In this project, I will be analysing the gun registration dataset and census dataset. The original source of datasets are from Census and NICS. The data from NICS (National Instant Criminal Background Check System) is used for background checks for the prospective buyer to buy firearms and/or explosives. The excel file of dataset for NICS is downloaded from nics.xlsx. The census dataset provides information of the people in all states of United states with most of the data is from the year 2016 and can be downloaded as csv file from US census data.csv.
Utilizing this dataset, I will be exploring the trend of gun registration over the years for the states of United States. Further analysis will be done to see the correlation of census data like population and income with the gun registration.
#import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
%matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 300
import warnings
warnings.filterwarnings('ignore')
The first step is to analyse both the datasets, looking into the variables and features, info, first few rows using head(), and checking for the missing and duplicated data.
#loading of the census and guns datasets
df_census = pd.read_csv('/Users/Divyakaur/Desktop/investigate-dataset/ncis-and-census-data/U.S. Census Data.csv')
df_guns = pd.read_excel('/Users/Divyakaur/Desktop/investigate-dataset/ncis-and-census-data/gun_data.xlsx')
#looking into first five rows of the census data set
df_census.head()
| Fact | Fact Note | Alabama | Alaska | Arizona | Arkansas | California | Colorado | Connecticut | Delaware | ... | South Dakota | Tennessee | Texas | Utah | Vermont | Virginia | Washington | West Virginia | Wisconsin | Wyoming | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Population estimates, July 1, 2016, (V2016) | NaN | 4,863,300 | 741,894 | 6,931,071 | 2,988,248 | 39,250,017 | 5,540,545 | 3,576,452 | 952,065 | ... | 865454 | 6651194 | 27,862,596 | 3,051,217 | 624,594 | 8,411,808 | 7,288,000 | 1,831,102 | 5,778,708 | 585,501 |
| 1 | Population estimates base, April 1, 2010, (V2... | NaN | 4,780,131 | 710,249 | 6,392,301 | 2,916,025 | 37,254,522 | 5,029,324 | 3,574,114 | 897,936 | ... | 814195 | 6346298 | 25,146,100 | 2,763,888 | 625,741 | 8,001,041 | 6,724,545 | 1,853,011 | 5,687,289 | 563,767 |
| 2 | Population, percent change - April 1, 2010 (es... | NaN | 1.70% | 4.50% | 8.40% | 2.50% | 5.40% | 10.20% | 0.10% | 6.00% | ... | 0.063 | 0.048 | 10.80% | 10.40% | -0.20% | 5.10% | 8.40% | -1.20% | 1.60% | 3.90% |
| 3 | Population, Census, April 1, 2010 | NaN | 4,779,736 | 710,231 | 6,392,017 | 2,915,918 | 37,253,956 | 5,029,196 | 3,574,097 | 897,934 | ... | 814180 | 6346105 | 25,145,561 | 2,763,885 | 625,741 | 8,001,024 | 6,724,540 | 1,852,994 | 5,686,986 | 563,626 |
| 4 | Persons under 5 years, percent, July 1, 2016, ... | NaN | 6.00% | 7.30% | 6.30% | 6.40% | 6.30% | 6.10% | 5.20% | 5.80% | ... | 0.071 | 0.061 | 7.20% | 8.30% | 4.90% | 6.10% | 6.20% | 5.50% | 5.80% | 6.50% |
5 rows × 52 columns
#looking into the shape of the census dataset
df_census.shape
(85, 52)
#looking into different columns in the dataset
df_census.columns
Index(['Fact', 'Fact Note', 'Alabama', 'Alaska', 'Arizona', 'Arkansas',
'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida',
'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas',
'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts',
'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana',
'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico',
'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma',
'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina',
'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia',
'Washington', 'West Virginia', 'Wisconsin', 'Wyoming'],
dtype='object')
#looking for missing values
df_census.isnull().sum()
Fact 5 Fact Note 57 Alabama 20 Alaska 20 Arizona 20 Arkansas 20 California 20 Colorado 20 Connecticut 20 Delaware 20 Florida 20 Georgia 20 Hawaii 20 Idaho 20 Illinois 20 Indiana 20 Iowa 20 Kansas 20 Kentucky 20 Louisiana 20 Maine 20 Maryland 20 Massachusetts 20 Michigan 20 Minnesota 20 Mississippi 20 Missouri 20 Montana 20 Nebraska 20 Nevada 20 New Hampshire 20 New Jersey 20 New Mexico 20 New York 20 North Carolina 20 North Dakota 20 Ohio 20 Oklahoma 20 Oregon 20 Pennsylvania 20 Rhode Island 20 South Carolina 20 South Dakota 20 Tennessee 20 Texas 20 Utah 20 Vermont 20 Virginia 20 Washington 20 West Virginia 20 Wisconsin 20 Wyoming 20 dtype: int64
#looking for duplicated values
df_census.duplicated().sum()
3
#looking for the types of each column
df_census.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 85 entries, 0 to 84 Data columns (total 52 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Fact 80 non-null object 1 Fact Note 28 non-null object 2 Alabama 65 non-null object 3 Alaska 65 non-null object 4 Arizona 65 non-null object 5 Arkansas 65 non-null object 6 California 65 non-null object 7 Colorado 65 non-null object 8 Connecticut 65 non-null object 9 Delaware 65 non-null object 10 Florida 65 non-null object 11 Georgia 65 non-null object 12 Hawaii 65 non-null object 13 Idaho 65 non-null object 14 Illinois 65 non-null object 15 Indiana 65 non-null object 16 Iowa 65 non-null object 17 Kansas 65 non-null object 18 Kentucky 65 non-null object 19 Louisiana 65 non-null object 20 Maine 65 non-null object 21 Maryland 65 non-null object 22 Massachusetts 65 non-null object 23 Michigan 65 non-null object 24 Minnesota 65 non-null object 25 Mississippi 65 non-null object 26 Missouri 65 non-null object 27 Montana 65 non-null object 28 Nebraska 65 non-null object 29 Nevada 65 non-null object 30 New Hampshire 65 non-null object 31 New Jersey 65 non-null object 32 New Mexico 65 non-null object 33 New York 65 non-null object 34 North Carolina 65 non-null object 35 North Dakota 65 non-null object 36 Ohio 65 non-null object 37 Oklahoma 65 non-null object 38 Oregon 65 non-null object 39 Pennsylvania 65 non-null object 40 Rhode Island 65 non-null object 41 South Carolina 65 non-null object 42 South Dakota 65 non-null object 43 Tennessee 65 non-null object 44 Texas 65 non-null object 45 Utah 65 non-null object 46 Vermont 65 non-null object 47 Virginia 65 non-null object 48 Washington 65 non-null object 49 West Virginia 65 non-null object 50 Wisconsin 65 non-null object 51 Wyoming 65 non-null object dtypes: object(52) memory usage: 34.7+ KB
#Looking into the rows of the Fact column of census datatset
df_census.Fact.unique()
array(['Population estimates, July 1, 2016, (V2016)',
'Population estimates base, April 1, 2010, (V2016)',
'Population, percent change - April 1, 2010 (estimates base) to July 1, 2016, (V2016)',
'Population, Census, April 1, 2010',
'Persons under 5 years, percent, July 1, 2016, (V2016)',
'Persons under 5 years, percent, April 1, 2010',
'Persons under 18 years, percent, July 1, 2016, (V2016)',
'Persons under 18 years, percent, April 1, 2010',
'Persons 65 years and over, percent, July 1, 2016, (V2016)',
'Persons 65 years and over, percent, April 1, 2010',
'Female persons, percent, July 1, 2016, (V2016)',
'Female persons, percent, April 1, 2010',
'White alone, percent, July 1, 2016, (V2016)',
'Black or African American alone, percent, July 1, 2016, (V2016)',
'American Indian and Alaska Native alone, percent, July 1, 2016, (V2016)',
'Asian alone, percent, July 1, 2016, (V2016)',
'Native Hawaiian and Other Pacific Islander alone, percent, July 1, 2016, (V2016)',
'Two or More Races, percent, July 1, 2016, (V2016)',
'Hispanic or Latino, percent, July 1, 2016, (V2016)',
'White alone, not Hispanic or Latino, percent, July 1, 2016, (V2016)',
'Veterans, 2011-2015', 'Foreign born persons, percent, 2011-2015',
'Housing units, July 1, 2016, (V2016)',
'Housing units, April 1, 2010',
'Owner-occupied housing unit rate, 2011-2015',
'Median value of owner-occupied housing units, 2011-2015',
'Median selected monthly owner costs -with a mortgage, 2011-2015',
'Median selected monthly owner costs -without a mortgage, 2011-2015',
'Median gross rent, 2011-2015', 'Building permits, 2016',
'Households, 2011-2015', 'Persons per household, 2011-2015',
'Living in same house 1 year ago, percent of persons age 1 year+, 2011-2015',
'Language other than English spoken at home, percent of persons age 5 years+, 2011-2015',
'High school graduate or higher, percent of persons age 25 years+, 2011-2015',
"Bachelor's degree or higher, percent of persons age 25 years+, 2011-2015",
'With a disability, under age 65 years, percent, 2011-2015',
'Persons without health insurance, under age 65 years, percent',
'In civilian labor force, total, percent of population age 16 years+, 2011-2015',
'In civilian labor force, female, percent of population age 16 years+, 2011-2015',
'Total accommodation and food services sales, 2012 ($1,000)',
'Total health care and social assistance receipts/revenue, 2012 ($1,000)',
'Total manufacturers shipments, 2012 ($1,000)',
'Total merchant wholesaler sales, 2012 ($1,000)',
'Total retail sales, 2012 ($1,000)',
'Total retail sales per capita, 2012',
'Mean travel time to work (minutes), workers age 16 years+, 2011-2015',
'Median household income (in 2015 dollars), 2011-2015',
'Per capita income in past 12 months (in 2015 dollars), 2011-2015',
'Persons in poverty, percent',
'Total employer establishments, 2015', 'Total employment, 2015',
'Total annual payroll, 2015 ($1,000)',
'Total employment, percent change, 2014-2015',
'Total nonemployer establishments, 2015', 'All firms, 2012',
'Men-owned firms, 2012', 'Women-owned firms, 2012',
'Minority-owned firms, 2012', 'Nonminority-owned firms, 2012',
'Veteran-owned firms, 2012', 'Nonveteran-owned firms, 2012',
'Population per square mile, 2010',
'Land area in square miles, 2010', 'FIPS Code', nan,
'NOTE: FIPS Code values are enclosed in quotes to ensure leading zeros remain intact.',
'Value Notes', '1', 'Fact Notes', '(a)', '(b)', '(c)',
'Value Flags', '-', 'D', 'F', 'FN', 'S', 'X', 'Z'], dtype=object)
df_census['Fact Note'].unique()
array([nan, '(a)', '(b)', '(c)',
'Includes data not distributed by county.',
'Includes persons reporting only one race',
'Hispanics may be of any race, so also are included in applicable race categories',
'Economic Census - Puerto Rico data are not comparable to U.S. Economic Census data',
'Either no or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest or upper interval of an open ended distribution.',
'Suppressed to avoid disclosure of confidential information',
'Fewer than 25 firms', 'Footnote on this item in place of data',
'Not available', 'Suppressed; does not meet publication standards',
'Not applicable',
'Value greater than zero but less than half unit of measure shown'],
dtype=object)
Based on the initial observation, the census data has 85 rows and 52 columns. Column 1 and column 2 are Fact and Fact Notes whereas column 3 to 52 are states of United states. The data has 3 duplicated columns and around 20 missing values.
More detailed information can be found in the website Census.
# looking into the dataset using head
df_guns.head()
| month | state | permit | permit_recheck | handgun | long_gun | other | multiple | admin | prepawn_handgun | ... | returned_other | rentals_handgun | rentals_long_gun | private_sale_handgun | private_sale_long_gun | private_sale_other | return_to_seller_handgun | return_to_seller_long_gun | return_to_seller_other | totals | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2017-09 | Alabama | 16717.0 | 0.0 | 5734.0 | 6320.0 | 221.0 | 317 | 0.0 | 15.0 | ... | 0.0 | 0.0 | 0.0 | 9.0 | 16.0 | 3.0 | 0.0 | 0.0 | 3.0 | 32019 |
| 1 | 2017-09 | Alaska | 209.0 | 2.0 | 2320.0 | 2930.0 | 219.0 | 160 | 0.0 | 5.0 | ... | 0.0 | 0.0 | 0.0 | 17.0 | 24.0 | 1.0 | 0.0 | 0.0 | 0.0 | 6303 |
| 2 | 2017-09 | Arizona | 5069.0 | 382.0 | 11063.0 | 7946.0 | 920.0 | 631 | 0.0 | 13.0 | ... | 0.0 | 0.0 | 0.0 | 38.0 | 12.0 | 2.0 | 0.0 | 0.0 | 0.0 | 28394 |
| 3 | 2017-09 | Arkansas | 2935.0 | 632.0 | 4347.0 | 6063.0 | 165.0 | 366 | 51.0 | 12.0 | ... | 0.0 | 0.0 | 0.0 | 13.0 | 23.0 | 0.0 | 0.0 | 2.0 | 1.0 | 17747 |
| 4 | 2017-09 | California | 57839.0 | 0.0 | 37165.0 | 24581.0 | 2984.0 | 0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 123506 |
5 rows × 27 columns
#looking into the number of rows and columns in the dataframe
df_guns.shape
(12485, 27)
#looking into the columns
df_guns.columns
Index(['month', 'state', 'permit', 'permit_recheck', 'handgun', 'long_gun',
'other', 'multiple', 'admin', 'prepawn_handgun', 'prepawn_long_gun',
'prepawn_other', 'redemption_handgun', 'redemption_long_gun',
'redemption_other', 'returned_handgun', 'returned_long_gun',
'returned_other', 'rentals_handgun', 'rentals_long_gun',
'private_sale_handgun', 'private_sale_long_gun', 'private_sale_other',
'return_to_seller_handgun', 'return_to_seller_long_gun',
'return_to_seller_other', 'totals'],
dtype='object')
# looking into the missing values
df_guns.isnull().sum()
month 0 state 0 permit 24 permit_recheck 11385 handgun 20 long_gun 19 other 6985 multiple 0 admin 23 prepawn_handgun 1943 prepawn_long_gun 1945 prepawn_other 7370 redemption_handgun 1940 redemption_long_gun 1941 redemption_other 7370 returned_handgun 10285 returned_long_gun 10340 returned_other 10670 rentals_handgun 11495 rentals_long_gun 11660 private_sale_handgun 9735 private_sale_long_gun 9735 private_sale_other 9735 return_to_seller_handgun 10010 return_to_seller_long_gun 9735 return_to_seller_other 10230 totals 0 dtype: int64
#looking into the duplicated values
df_guns.duplicated().sum()
0
Based on the initial observation, the guns registration data has 12485 rows and 27 columns. Each column permit, permit recheck represents the type of transaction submitted to the National Instant Criminal Background Check System (NICS). The NICS Firearm Background Checks happened every month and Year for each of the states.
The transactions are divided based on the type of firearm - handgun, long gun and other (refers to frames, receivers and other firearm that are neither handguns nor long guns. Multiple denotes background check for more than one type of firearm background check while admin denotes the administrative checks.
Other transactions corresponding to handgun, long gun and other are:
Note: All these are background checks requested by official-licenced Federal Firearms Licensee (FFL) or criminal law enforcement agent prior to the issuance of a firearm-related permit or transfer. Much of the detailed information can be found from this website NICS_details.
#looking for duplicated data set
df_census[df_census.duplicated()]
| Fact | Fact Note | Alabama | Alaska | Arizona | Arkansas | California | Colorado | Connecticut | Delaware | ... | South Dakota | Tennessee | Texas | Utah | Vermont | Virginia | Washington | West Virginia | Wisconsin | Wyoming | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 67 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 70 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 75 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 rows × 52 columns
As we can see, all the rows are filled with NaN, so we will drop these rows and will look into the data to confirm that only three rows that are duplicated are removed.
#drop duplicated rows
df_census.drop_duplicates(inplace=True)
df_census.shape
(82, 52)
Next thing to point out about the dataset is the type of each column. All the datatypes are objects and column states has string characters in it. Therefore, except Fact and Fact Note column which should be object type, all other columns should be numeric. Therefore, unnecessary string characters has other characters like $ and , including quotes needs to be removed before changing it into numeric datatype.
Also, as we looked into "Fact Note" column, most of the rows are filled with NaN and there are few notes corresponding to it. Since only fact column is neccessary to define the data, Fact Note is removed.
#looking into dataset
df_census.head()
| Fact | Fact Note | Alabama | Alaska | Arizona | Arkansas | California | Colorado | Connecticut | Delaware | ... | South Dakota | Tennessee | Texas | Utah | Vermont | Virginia | Washington | West Virginia | Wisconsin | Wyoming | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Population estimates, July 1, 2016, (V2016) | NaN | 4,863,300 | 741,894 | 6,931,071 | 2,988,248 | 39,250,017 | 5,540,545 | 3,576,452 | 952,065 | ... | 865454 | 6651194 | 27,862,596 | 3,051,217 | 624,594 | 8,411,808 | 7,288,000 | 1,831,102 | 5,778,708 | 585,501 |
| 1 | Population estimates base, April 1, 2010, (V2... | NaN | 4,780,131 | 710,249 | 6,392,301 | 2,916,025 | 37,254,522 | 5,029,324 | 3,574,114 | 897,936 | ... | 814195 | 6346298 | 25,146,100 | 2,763,888 | 625,741 | 8,001,041 | 6,724,545 | 1,853,011 | 5,687,289 | 563,767 |
| 2 | Population, percent change - April 1, 2010 (es... | NaN | 1.70% | 4.50% | 8.40% | 2.50% | 5.40% | 10.20% | 0.10% | 6.00% | ... | 0.063 | 0.048 | 10.80% | 10.40% | -0.20% | 5.10% | 8.40% | -1.20% | 1.60% | 3.90% |
| 3 | Population, Census, April 1, 2010 | NaN | 4,779,736 | 710,231 | 6,392,017 | 2,915,918 | 37,253,956 | 5,029,196 | 3,574,097 | 897,934 | ... | 814180 | 6346105 | 25,145,561 | 2,763,885 | 625,741 | 8,001,024 | 6,724,540 | 1,852,994 | 5,686,986 | 563,626 |
| 4 | Persons under 5 years, percent, July 1, 2016, ... | NaN | 6.00% | 7.30% | 6.30% | 6.40% | 6.30% | 6.10% | 5.20% | 5.80% | ... | 0.071 | 0.061 | 7.20% | 8.30% | 4.90% | 6.10% | 6.20% | 5.50% | 5.80% | 6.50% |
5 rows × 52 columns
#dropping Fact Note column from the census dataset
df_census = df_census.drop(columns="Fact Note", axis=1)
df_census
| Fact | Alabama | Alaska | Arizona | Arkansas | California | Colorado | Connecticut | Delaware | Florida | ... | South Dakota | Tennessee | Texas | Utah | Vermont | Virginia | Washington | West Virginia | Wisconsin | Wyoming | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Population estimates, July 1, 2016, (V2016) | 4,863,300 | 741,894 | 6,931,071 | 2,988,248 | 39,250,017 | 5,540,545 | 3,576,452 | 952,065 | 20,612,439 | ... | 865454 | 6651194 | 27,862,596 | 3,051,217 | 624,594 | 8,411,808 | 7,288,000 | 1,831,102 | 5,778,708 | 585,501 |
| 1 | Population estimates base, April 1, 2010, (V2... | 4,780,131 | 710,249 | 6,392,301 | 2,916,025 | 37,254,522 | 5,029,324 | 3,574,114 | 897,936 | 18,804,592 | ... | 814195 | 6346298 | 25,146,100 | 2,763,888 | 625,741 | 8,001,041 | 6,724,545 | 1,853,011 | 5,687,289 | 563,767 |
| 2 | Population, percent change - April 1, 2010 (es... | 1.70% | 4.50% | 8.40% | 2.50% | 5.40% | 10.20% | 0.10% | 6.00% | 9.60% | ... | 0.063 | 0.048 | 10.80% | 10.40% | -0.20% | 5.10% | 8.40% | -1.20% | 1.60% | 3.90% |
| 3 | Population, Census, April 1, 2010 | 4,779,736 | 710,231 | 6,392,017 | 2,915,918 | 37,253,956 | 5,029,196 | 3,574,097 | 897,934 | 18,801,310 | ... | 814180 | 6346105 | 25,145,561 | 2,763,885 | 625,741 | 8,001,024 | 6,724,540 | 1,852,994 | 5,686,986 | 563,626 |
| 4 | Persons under 5 years, percent, July 1, 2016, ... | 6.00% | 7.30% | 6.30% | 6.40% | 6.30% | 6.10% | 5.20% | 5.80% | 5.50% | ... | 0.071 | 0.061 | 7.20% | 8.30% | 4.90% | 6.10% | 6.20% | 5.50% | 5.80% | 6.50% |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 80 | FN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 81 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 82 | S | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 83 | X | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 84 | Z | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
82 rows × 51 columns
Next thing is to look for missing data.
# Looking into missing data
df_census.isnull().sum()
Fact 2 Alabama 17 Alaska 17 Arizona 17 Arkansas 17 California 17 Colorado 17 Connecticut 17 Delaware 17 Florida 17 Georgia 17 Hawaii 17 Idaho 17 Illinois 17 Indiana 17 Iowa 17 Kansas 17 Kentucky 17 Louisiana 17 Maine 17 Maryland 17 Massachusetts 17 Michigan 17 Minnesota 17 Mississippi 17 Missouri 17 Montana 17 Nebraska 17 Nevada 17 New Hampshire 17 New Jersey 17 New Mexico 17 New York 17 North Carolina 17 North Dakota 17 Ohio 17 Oklahoma 17 Oregon 17 Pennsylvania 17 Rhode Island 17 South Carolina 17 South Dakota 17 Tennessee 17 Texas 17 Utah 17 Vermont 17 Virginia 17 Washington 17 West Virginia 17 Wisconsin 17 Wyoming 17 dtype: int64
# Looking into missing data
df_census[df_census.isnull()]
| Fact | Alabama | Alaska | Arizona | Arkansas | California | Colorado | Connecticut | Delaware | Florida | ... | South Dakota | Tennessee | Texas | Utah | Vermont | Virginia | Washington | West Virginia | Wisconsin | Wyoming | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 80 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 81 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 82 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 83 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 84 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
82 rows × 51 columns
Looking into the missing values, there are in total 17 rows of states that are filled with null values and therefore are removed.
# removing null values from the census dataset
df_census = df_census.dropna()
df_census.shape
(65, 51)
The number of rows and columns left after removing the missing and duplicated values are now 65 rows and 51 columns.
Next thing I am working on is removing the unncessary characters and converting to numeric type for all the columns except the Fact column (which is a description). Other columns are representing the population data and the percentage that must be in numeric form.
# converting to numeric data
states = ['Alabama', 'Alaska', 'Arizona', 'Arkansas',
'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida',
'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas',
'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts',
'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana',
'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico',
'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma',
'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina',
'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia',
'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']
characters = ['%', ',', '$']
for col in states:
for char in characters:
df_census[col] = df_census[col].apply(lambda x: str(x).replace(char, ''))
# removing quotes from FIPS code row
#df_census[col] = df_census[col].str.replace(r'[^0-9]+', '')
df_census.head(5)
| Fact | Alabama | Alaska | Arizona | Arkansas | California | Colorado | Connecticut | Delaware | Florida | ... | South Dakota | Tennessee | Texas | Utah | Vermont | Virginia | Washington | West Virginia | Wisconsin | Wyoming | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Population estimates, July 1, 2016, (V2016) | 4863300 | 741894 | 6931071 | 2988248 | 39250017 | 5540545 | 3576452 | 952065 | 20612439 | ... | 865454 | 6651194 | 27862596 | 3051217 | 624594 | 8411808 | 7288000 | 1831102 | 5778708 | 585501 |
| 1 | Population estimates base, April 1, 2010, (V2... | 4780131 | 710249 | 6392301 | 2916025 | 37254522 | 5029324 | 3574114 | 897936 | 18804592 | ... | 814195 | 6346298 | 25146100 | 2763888 | 625741 | 8001041 | 6724545 | 1853011 | 5687289 | 563767 |
| 2 | Population, percent change - April 1, 2010 (es... | 1.70 | 4.50 | 8.40 | 2.50 | 5.40 | 10.20 | 0.10 | 6.00 | 9.60 | ... | 0.063 | 0.048 | 10.80 | 10.40 | -0.20 | 5.10 | 8.40 | -1.20 | 1.60 | 3.90 |
| 3 | Population, Census, April 1, 2010 | 4779736 | 710231 | 6392017 | 2915918 | 37253956 | 5029196 | 3574097 | 897934 | 18801310 | ... | 814180 | 6346105 | 25145561 | 2763885 | 625741 | 8001024 | 6724540 | 1852994 | 5686986 | 563626 |
| 4 | Persons under 5 years, percent, July 1, 2016, ... | 6.00 | 7.30 | 6.30 | 6.40 | 6.30 | 6.10 | 5.20 | 5.80 | 5.50 | ... | 0.071 | 0.061 | 7.20 | 8.30 | 4.90 | 6.10 | 6.20 | 5.50 | 5.80 | 6.50 |
5 rows × 51 columns
df_census.head(1)
| Fact | Alabama | Alaska | Arizona | Arkansas | California | Colorado | Connecticut | Delaware | Florida | ... | South Dakota | Tennessee | Texas | Utah | Vermont | Virginia | Washington | West Virginia | Wisconsin | Wyoming | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Population estimates, July 1, 2016, (V2016) | 4863300 | 741894 | 6931071 | 2988248 | 39250017 | 5540545 | 3576452 | 952065 | 20612439 | ... | 865454 | 6651194 | 27862596 | 3051217 | 624594 | 8411808 | 7288000 | 1831102 | 5778708 | 585501 |
1 rows × 51 columns
# changing to numeric form
#df_census[col] = df_census[col].apply(lambda x: str(x).replace(char, ''))
#[pd.to_numeric(df_census[col]) for col in states]
While attempting to change the datatype to numeric above, it runs an error showing: "ValueError: Unable to parse string ""01"" at position 64", let us closely look into row 64 and see how it looks like.
Note:to avoid the cell showing error, I commented out.
#looking into 64 row and all columns
df_census.loc[64,:]
Fact FIPS Code Alabama "01" Alaska "02" Arizona "04" Arkansas "05" California "06" Colorado "08" Connecticut "09" Delaware "10" Florida "12" Georgia "13" Hawaii "15" Idaho "16" Illinois "17" Indiana "18" Iowa "19" Kansas "20" Kentucky "21" Louisiana "22" Maine "23" Maryland "24" Massachusetts "25" Michigan "26" Minnesota "27" Mississippi "28" Missouri "29" Montana "30" Nebraska "31" Nevada "32" New Hampshire "33" New Jersey "34" New Mexico "35" New York "36" North Carolina "37" North Dakota "38" Ohio "39" Oklahoma "40" Oregon "41" Pennsylvania "42" Rhode Island "44" South Carolina "45" South Dakota "46" Tennessee "47" Texas "48" Utah "49" Vermont "50" Virginia "51" Washington "53" West Virginia "54" Wisconsin "55" Wyoming "56" Name: 64, dtype: object
FIPS code in row 64 has a quotes due to which the data could not convert in the numeric form. Because the analysis I am working on involves information about the population data and not the code, this row is removed from the dataset.
#removing the FIPS code row from the dataset
df_census = df_census[~(df_census["Fact"] == "FIPS Code")]
df_census
| Fact | Alabama | Alaska | Arizona | Arkansas | California | Colorado | Connecticut | Delaware | Florida | ... | South Dakota | Tennessee | Texas | Utah | Vermont | Virginia | Washington | West Virginia | Wisconsin | Wyoming | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Population estimates, July 1, 2016, (V2016) | 4863300 | 741894 | 6931071 | 2988248 | 39250017 | 5540545 | 3576452 | 952065 | 20612439 | ... | 865454 | 6651194 | 27862596 | 3051217 | 624594 | 8411808 | 7288000 | 1831102 | 5778708 | 585501 |
| 1 | Population estimates base, April 1, 2010, (V2... | 4780131 | 710249 | 6392301 | 2916025 | 37254522 | 5029324 | 3574114 | 897936 | 18804592 | ... | 814195 | 6346298 | 25146100 | 2763888 | 625741 | 8001041 | 6724545 | 1853011 | 5687289 | 563767 |
| 2 | Population, percent change - April 1, 2010 (es... | 1.70 | 4.50 | 8.40 | 2.50 | 5.40 | 10.20 | 0.10 | 6.00 | 9.60 | ... | 0.063 | 0.048 | 10.80 | 10.40 | -0.20 | 5.10 | 8.40 | -1.20 | 1.60 | 3.90 |
| 3 | Population, Census, April 1, 2010 | 4779736 | 710231 | 6392017 | 2915918 | 37253956 | 5029196 | 3574097 | 897934 | 18801310 | ... | 814180 | 6346105 | 25145561 | 2763885 | 625741 | 8001024 | 6724540 | 1852994 | 5686986 | 563626 |
| 4 | Persons under 5 years, percent, July 1, 2016, ... | 6.00 | 7.30 | 6.30 | 6.40 | 6.30 | 6.10 | 5.20 | 5.80 | 5.50 | ... | 0.071 | 0.061 | 7.20 | 8.30 | 4.90 | 6.10 | 6.20 | 5.50 | 5.80 | 6.50 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 59 | Nonminority-owned firms, 2012 | 272651 | 51147 | 344981 | 189029 | 1819107 | 442365 | 259614 | 54782 | 1121749 | ... | 74228 | 434025 | 1224845 | 218826 | 70491 | 450109 | 426697 | 104785 | 379934 | 55397 |
| 60 | Veteran-owned firms, 2012 | 41943 | 7953 | 46780 | 25915 | 252377 | 51722 | 31056 | 7206 | 185756 | ... | 8604 | 59379 | 213590 | 18754 | 8237 | 76434 | 49331 | 12912 | 39830 | 6470 |
| 61 | Nonveteran-owned firms, 2012 | 316984 | 56091 | 427582 | 192988 | 3176341 | 469524 | 281182 | 60318 | 1846686 | ... | 66219 | 469392 | 2057218 | 219807 | 63317 | 548439 | 461401 | 94960 | 370755 | 51353 |
| 62 | Population per square mile, 2010 | 94.4 | 1.2 | 56.3 | 56 | 239.1 | 48.5 | 738.1 | 460.8 | 350.6 | ... | 10.7 | 153.9 | 96.3 | 33.6 | 67.9 | 202.6 | 101.2 | 77.1 | 105 | 5.8 |
| 63 | Land area in square miles, 2010 | 50645.33 | 570640.95 | 113594.08 | 52035.48 | 155779.22 | 103641.89 | 4842.36 | 1948.54 | 53624.76 | ... | 75811 | 41234.9 | 261231.71 | 82169.62 | 9216.66 | 39490.09 | 66455.52 | 24038.21 | 54157.80 | 97093.14 |
64 rows × 51 columns
Let us further give a close look into the "Fact" column and see what rows are required for the analysis.¶
#looking into the Fact column
df_census.Fact.unique()
array(['Population estimates, July 1, 2016, (V2016)',
'Population estimates base, April 1, 2010, (V2016)',
'Population, percent change - April 1, 2010 (estimates base) to July 1, 2016, (V2016)',
'Population, Census, April 1, 2010',
'Persons under 5 years, percent, July 1, 2016, (V2016)',
'Persons under 5 years, percent, April 1, 2010',
'Persons under 18 years, percent, July 1, 2016, (V2016)',
'Persons under 18 years, percent, April 1, 2010',
'Persons 65 years and over, percent, July 1, 2016, (V2016)',
'Persons 65 years and over, percent, April 1, 2010',
'Female persons, percent, July 1, 2016, (V2016)',
'Female persons, percent, April 1, 2010',
'White alone, percent, July 1, 2016, (V2016)',
'Black or African American alone, percent, July 1, 2016, (V2016)',
'American Indian and Alaska Native alone, percent, July 1, 2016, (V2016)',
'Asian alone, percent, July 1, 2016, (V2016)',
'Native Hawaiian and Other Pacific Islander alone, percent, July 1, 2016, (V2016)',
'Two or More Races, percent, July 1, 2016, (V2016)',
'Hispanic or Latino, percent, July 1, 2016, (V2016)',
'White alone, not Hispanic or Latino, percent, July 1, 2016, (V2016)',
'Veterans, 2011-2015', 'Foreign born persons, percent, 2011-2015',
'Housing units, July 1, 2016, (V2016)',
'Housing units, April 1, 2010',
'Owner-occupied housing unit rate, 2011-2015',
'Median value of owner-occupied housing units, 2011-2015',
'Median selected monthly owner costs -with a mortgage, 2011-2015',
'Median selected monthly owner costs -without a mortgage, 2011-2015',
'Median gross rent, 2011-2015', 'Building permits, 2016',
'Households, 2011-2015', 'Persons per household, 2011-2015',
'Living in same house 1 year ago, percent of persons age 1 year+, 2011-2015',
'Language other than English spoken at home, percent of persons age 5 years+, 2011-2015',
'High school graduate or higher, percent of persons age 25 years+, 2011-2015',
"Bachelor's degree or higher, percent of persons age 25 years+, 2011-2015",
'With a disability, under age 65 years, percent, 2011-2015',
'Persons without health insurance, under age 65 years, percent',
'In civilian labor force, total, percent of population age 16 years+, 2011-2015',
'In civilian labor force, female, percent of population age 16 years+, 2011-2015',
'Total accommodation and food services sales, 2012 ($1,000)',
'Total health care and social assistance receipts/revenue, 2012 ($1,000)',
'Total manufacturers shipments, 2012 ($1,000)',
'Total merchant wholesaler sales, 2012 ($1,000)',
'Total retail sales, 2012 ($1,000)',
'Total retail sales per capita, 2012',
'Mean travel time to work (minutes), workers age 16 years+, 2011-2015',
'Median household income (in 2015 dollars), 2011-2015',
'Per capita income in past 12 months (in 2015 dollars), 2011-2015',
'Persons in poverty, percent',
'Total employer establishments, 2015', 'Total employment, 2015',
'Total annual payroll, 2015 ($1,000)',
'Total employment, percent change, 2014-2015',
'Total nonemployer establishments, 2015', 'All firms, 2012',
'Men-owned firms, 2012', 'Women-owned firms, 2012',
'Minority-owned firms, 2012', 'Nonminority-owned firms, 2012',
'Veteran-owned firms, 2012', 'Nonveteran-owned firms, 2012',
'Population per square mile, 2010',
'Land area in square miles, 2010'], dtype=object)
One of the question that is explored from this dataset is looking into the population in different states for the year 2010 and 2016 and its correlation with the NICS dataset.
Other question is impact of income (median household, per capita) on gun registration dataset. Hence, I will be selecting the rows that highlight those data.
#selection of the rows and saving the dataframe having these rows
census = ['Population estimates, July 1, 2016, (V2016)',
'Population estimates base, April 1, 2010, (V2016)',
'Persons under 18 years, percent, July 1, 2016, (V2016)',
'Persons under 18 years, percent, April 1, 2010',
'Persons 65 years and over, percent, July 1, 2016, (V2016)',
'Persons 65 years and over, percent, April 1, 2010',
'White alone, percent, July 1, 2016, (V2016)',
'Black or African American alone, percent, July 1, 2016, (V2016)',
'American Indian and Alaska Native alone, percent, July 1, 2016, (V2016)',
'Asian alone, percent, July 1, 2016, (V2016)',
'Native Hawaiian and Other Pacific Islander alone, percent, July 1, 2016, (V2016)',
'Two or More Races, percent, July 1, 2016, (V2016)',
'Hispanic or Latino, percent, July 1, 2016, (V2016)',
'White alone, not Hispanic or Latino, percent, July 1, 2016, (V2016)',
'High school graduate or higher, percent of persons age 25 years+, 2011-2015',
"Bachelor's degree or higher, percent of persons age 25 years+, 2011-2015",
'Median household income (in 2015 dollars), 2011-2015',
'Per capita income in past 12 months (in 2015 dollars), 2011-2015']
census = df_census.query('Fact in @census')
census.Fact.unique()
array(['Population estimates, July 1, 2016, (V2016)',
'Population estimates base, April 1, 2010, (V2016)',
'Persons under 18 years, percent, July 1, 2016, (V2016)',
'Persons under 18 years, percent, April 1, 2010',
'Persons 65 years and over, percent, July 1, 2016, (V2016)',
'Persons 65 years and over, percent, April 1, 2010',
'White alone, percent, July 1, 2016, (V2016)',
'Black or African American alone, percent, July 1, 2016, (V2016)',
'American Indian and Alaska Native alone, percent, July 1, 2016, (V2016)',
'Asian alone, percent, July 1, 2016, (V2016)',
'Native Hawaiian and Other Pacific Islander alone, percent, July 1, 2016, (V2016)',
'Two or More Races, percent, July 1, 2016, (V2016)',
'Hispanic or Latino, percent, July 1, 2016, (V2016)',
'White alone, not Hispanic or Latino, percent, July 1, 2016, (V2016)',
'High school graduate or higher, percent of persons age 25 years+, 2011-2015',
"Bachelor's degree or higher, percent of persons age 25 years+, 2011-2015",
'Median household income (in 2015 dollars), 2011-2015',
'Per capita income in past 12 months (in 2015 dollars), 2011-2015'],
dtype=object)
# changing all the states data to numeric form
#[pd.to_numeric(census[col]) for col in states]
While attempting again to change the datatype to numeric above, it runs an error showing: "ValueError: Unable to parse string "Z" at position 10", let us closely look into row 10 and see how the data looks like. There was another value error showing "ValueError: Unable to parse string "D" at position 42" which is also fixed.
Note:to avoid the cell showing error above, I commented out.
#Looking into the 10th row
census.iloc[10]
Fact Native Hawaiian and Other Pacific Islander alo... Alabama 0.10 Alaska 1.30 Arizona 0.30 Arkansas 0.30 California 0.50 Colorado 0.20 Connecticut 0.10 Delaware 0.10 Florida 0.10 Georgia 0.10 Hawaii 10.20 Idaho 0.20 Illinois 0.10 Indiana 0.10 Iowa 0.10 Kansas 0.10 Kentucky 0.10 Louisiana 0.10 Maine Z Maryland 0.10 Massachusetts 0.10 Michigan Z Minnesota 0.10 Mississippi 0.10 Missouri 0.10 Montana 0.10 Nebraska 0.10 Nevada 0.80 New Hampshire 0.10 New Jersey 0.10 New Mexico 0.002 New York 0.001 North Carolina 0.001 North Dakota 0.001 Ohio 0.001 Oklahoma 0.002 Oregon 0.004 Pennsylvania 0.001 Rhode Island 0.002 South Carolina 0.001 South Dakota 0.001 Tennessee 0.001 Texas 0.10 Utah 1.00 Vermont Z Virginia 0.10 Washington 0.80 West Virginia Z Wisconsin 0.10 Wyoming 0.10 Name: 16, dtype: object
It shows that there is a string Z in the 10th row for those states whose population in percentage is unknown. I will be changing to 0 as we do not have data for the percentage of the Native Hawaiin Race corresponding to four states Maine, Michigan, Vermont and West Virginia.
#changing Z and D string to zero
#census.loc[:, ["Maine", "Michigan","Vermont","West Virginia"]].replace('Z',0)
census = census.replace(["Z", "D"],[0,0])
census
| Fact | Alabama | Alaska | Arizona | Arkansas | California | Colorado | Connecticut | Delaware | Florida | ... | South Dakota | Tennessee | Texas | Utah | Vermont | Virginia | Washington | West Virginia | Wisconsin | Wyoming | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Population estimates, July 1, 2016, (V2016) | 4863300 | 741894 | 6931071 | 2988248 | 39250017 | 5540545 | 3576452 | 952065 | 20612439 | ... | 865454 | 6651194 | 27862596 | 3051217 | 624594 | 8411808 | 7288000 | 1831102 | 5778708 | 585501 |
| 1 | Population estimates base, April 1, 2010, (V2... | 4780131 | 710249 | 6392301 | 2916025 | 37254522 | 5029324 | 3574114 | 897936 | 18804592 | ... | 814195 | 6346298 | 25146100 | 2763888 | 625741 | 8001041 | 6724545 | 1853011 | 5687289 | 563767 |
| 6 | Persons under 18 years, percent, July 1, 2016,... | 22.60 | 25.20 | 23.50 | 23.60 | 23.20 | 22.80 | 21.10 | 21.50 | 20.10 | ... | 0.246 | 0.226 | 26.20 | 30.20 | 19.00 | 22.20 | 22.40 | 20.50 | 22.30 | 23.70 |
| 7 | Persons under 18 years, percent, April 1, 2010 | 23.70 | 26.40 | 25.50 | 24.40 | 25.00 | 24.40 | 22.90 | 22.90 | 21.30 | ... | 0.249 | 0.236 | 27.30 | 31.50 | 20.70 | 23.20 | 23.50 | 20.90 | 23.60 | 24.00 |
| 8 | Persons 65 years and over, percent, July 1, 2... | 16.10 | 10.40 | 16.90 | 16.30 | 13.60 | 13.40 | 16.10 | 17.50 | 19.90 | ... | 0.16 | 0.157 | 12.00 | 10.50 | 18.10 | 14.60 | 14.80 | 18.80 | 16.10 | 15.00 |
| 9 | Persons 65 years and over, percent, April 1, 2010 | 13.80 | 7.70 | 13.80 | 14.40 | 11.40 | 10.90 | 14.20 | 14.40 | 17.30 | ... | 0.143 | 0.134 | 10.30 | 9.00 | 14.60 | 12.20 | 12.30 | 16.00 | 13.70 | 12.40 |
| 12 | White alone, percent, July 1, 2016, (V2016) | 69.30 | 66.10 | 83.30 | 79.40 | 72.70 | 87.50 | 80.60 | 70.10 | 77.60 | ... | 0.852 | 0.787 | 79.40 | 91.10 | 94.60 | 70.00 | 80.00 | 93.60 | 87.50 | 92.80 |
| 13 | Black or African American alone, percent, July... | 26.80 | 3.80 | 4.90 | 15.70 | 6.50 | 4.50 | 11.80 | 22.60 | 16.80 | ... | 0.02 | 0.171 | 12.60 | 1.40 | 1.30 | 19.80 | 4.10 | 3.60 | 6.60 | 1.30 |
| 14 | American Indian and Alaska Native alone, perce... | 0.70 | 15.20 | 5.40 | 1.00 | 1.70 | 1.60 | 0.50 | 0.60 | 0.50 | ... | 0.09 | 0.004 | 1.00 | 1.60 | 0.40 | 0.50 | 1.90 | 0.20 | 1.10 | 2.70 |
| 15 | Asian alone, percent, July 1, 2016, (V2016) | 1.40 | 6.30 | 3.40 | 1.60 | 14.80 | 3.30 | 4.70 | 4.00 | 2.90 | ... | 0.015 | 0.018 | 4.80 | 2.50 | 1.80 | 6.60 | 8.60 | 0.80 | 2.80 | 1.00 |
| 16 | Native Hawaiian and Other Pacific Islander alo... | 0.10 | 1.30 | 0.30 | 0.30 | 0.50 | 0.20 | 0.10 | 0.10 | 0.10 | ... | 0.001 | 0.001 | 0.10 | 1.00 | 0 | 0.10 | 0.80 | 0 | 0.10 | 0.10 |
| 17 | Two or More Races, percent, July 1, 2016, (V2... | 1.60 | 7.30 | 2.80 | 2.00 | 3.80 | 3.00 | 2.30 | 2.60 | 2.10 | ... | 0.023 | 0.019 | 1.90 | 2.50 | 1.90 | 2.90 | 4.60 | 1.70 | 1.90 | 2.10 |
| 18 | Hispanic or Latino, percent, July 1, 2016, (V... | 4.20 | 7.00 | 30.90 | 7.30 | 38.90 | 21.30 | 15.70 | 9.20 | 24.90 | ... | 0.037 | 0.052 | 39.10 | 13.80 | 1.90 | 9.10 | 12.40 | 1.50 | 6.70 | 10.00 |
| 19 | White alone, not Hispanic or Latino, percent, ... | 65.80 | 61.20 | 55.50 | 72.90 | 37.70 | 68.60 | 67.70 | 62.90 | 54.90 | ... | 0.825 | 0.742 | 42.60 | 78.80 | 93.10 | 62.40 | 69.50 | 92.30 | 81.70 | 84.10 |
| 34 | High school graduate or higher, percent of per... | 84.30 | 92.10 | 86.00 | 84.80 | 81.80 | 90.70 | 89.90 | 88.40 | 86.90 | ... | 0.909 | 0.855 | 81.90 | 91.20 | 91.80 | 88.30 | 90.40 | 85.00 | 91.00 | 92.30 |
| 35 | Bachelor's degree or higher, percent of person... | 23.50 | 28.00 | 27.50 | 21.10 | 31.40 | 38.10 | 37.60 | 30.00 | 27.30 | ... | 0.27 | 0.249 | 27.60 | 31.10 | 36.00 | 36.30 | 32.90 | 19.20 | 27.80 | 25.70 |
| 47 | Median household income (in 2015 dollars), 201... | 43623 | 72515 | 50255 | 41371 | 61818 | 60629 | 70331 | 60509 | 47507 | ... | 50957 | 45219 | 53207 | 60727 | 55176 | 65015 | 61062 | 41751 | 53357 | 58840 |
| 48 | Per capita income in past 12 months (in 2015 d... | 24091 | 33413 | 25848 | 22798 | 30318 | 32217 | 38803 | 30554 | 26829 | ... | 26747 | 25227 | 26999 | 24686 | 29894 | 34152 | 31762 | 23450 | 28340 | 29803 |
18 rows × 51 columns
#to check row 10 if the string Z is replaced by 0.
census[(census.Fact == "Native Hawaiian and Other Pacific Islander alone, percent, July 1, 2016, (V2016)") ]
| Fact | Alabama | Alaska | Arizona | Arkansas | California | Colorado | Connecticut | Delaware | Florida | ... | South Dakota | Tennessee | Texas | Utah | Vermont | Virginia | Washington | West Virginia | Wisconsin | Wyoming | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 16 | Native Hawaiian and Other Pacific Islander alo... | 0.10 | 1.30 | 0.30 | 0.30 | 0.50 | 0.20 | 0.10 | 0.10 | 0.10 | ... | 0.001 | 0.001 | 0.10 | 1.00 | 0 | 0.10 | 0.80 | 0 | 0.10 | 0.10 |
1 rows × 51 columns
# changing all the states data to numeric form
census = census[states].astype('float64')
#census['Fact'] = df_census_2016['Fact']
census.insert(loc=0, column='Fact', value=df_census['Fact'])
census
| Fact | Alabama | Alaska | Arizona | Arkansas | California | Colorado | Connecticut | Delaware | Florida | ... | South Dakota | Tennessee | Texas | Utah | Vermont | Virginia | Washington | West Virginia | Wisconsin | Wyoming | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Population estimates, July 1, 2016, (V2016) | 4863300.0 | 741894.0 | 6931071.0 | 2988248.0 | 39250017.0 | 5540545.0 | 3576452.0 | 952065.0 | 20612439.0 | ... | 865454.000 | 6651194.000 | 27862596.0 | 3051217.0 | 624594.0 | 8411808.0 | 7288000.0 | 1831102.0 | 5778708.0 | 585501.0 |
| 1 | Population estimates base, April 1, 2010, (V2... | 4780131.0 | 710249.0 | 6392301.0 | 2916025.0 | 37254522.0 | 5029324.0 | 3574114.0 | 897936.0 | 18804592.0 | ... | 814195.000 | 6346298.000 | 25146100.0 | 2763888.0 | 625741.0 | 8001041.0 | 6724545.0 | 1853011.0 | 5687289.0 | 563767.0 |
| 6 | Persons under 18 years, percent, July 1, 2016,... | 22.6 | 25.2 | 23.5 | 23.6 | 23.2 | 22.8 | 21.1 | 21.5 | 20.1 | ... | 0.246 | 0.226 | 26.2 | 30.2 | 19.0 | 22.2 | 22.4 | 20.5 | 22.3 | 23.7 |
| 7 | Persons under 18 years, percent, April 1, 2010 | 23.7 | 26.4 | 25.5 | 24.4 | 25.0 | 24.4 | 22.9 | 22.9 | 21.3 | ... | 0.249 | 0.236 | 27.3 | 31.5 | 20.7 | 23.2 | 23.5 | 20.9 | 23.6 | 24.0 |
| 8 | Persons 65 years and over, percent, July 1, 2... | 16.1 | 10.4 | 16.9 | 16.3 | 13.6 | 13.4 | 16.1 | 17.5 | 19.9 | ... | 0.160 | 0.157 | 12.0 | 10.5 | 18.1 | 14.6 | 14.8 | 18.8 | 16.1 | 15.0 |
| 9 | Persons 65 years and over, percent, April 1, 2010 | 13.8 | 7.7 | 13.8 | 14.4 | 11.4 | 10.9 | 14.2 | 14.4 | 17.3 | ... | 0.143 | 0.134 | 10.3 | 9.0 | 14.6 | 12.2 | 12.3 | 16.0 | 13.7 | 12.4 |
| 12 | White alone, percent, July 1, 2016, (V2016) | 69.3 | 66.1 | 83.3 | 79.4 | 72.7 | 87.5 | 80.6 | 70.1 | 77.6 | ... | 0.852 | 0.787 | 79.4 | 91.1 | 94.6 | 70.0 | 80.0 | 93.6 | 87.5 | 92.8 |
| 13 | Black or African American alone, percent, July... | 26.8 | 3.8 | 4.9 | 15.7 | 6.5 | 4.5 | 11.8 | 22.6 | 16.8 | ... | 0.020 | 0.171 | 12.6 | 1.4 | 1.3 | 19.8 | 4.1 | 3.6 | 6.6 | 1.3 |
| 14 | American Indian and Alaska Native alone, perce... | 0.7 | 15.2 | 5.4 | 1.0 | 1.7 | 1.6 | 0.5 | 0.6 | 0.5 | ... | 0.090 | 0.004 | 1.0 | 1.6 | 0.4 | 0.5 | 1.9 | 0.2 | 1.1 | 2.7 |
| 15 | Asian alone, percent, July 1, 2016, (V2016) | 1.4 | 6.3 | 3.4 | 1.6 | 14.8 | 3.3 | 4.7 | 4.0 | 2.9 | ... | 0.015 | 0.018 | 4.8 | 2.5 | 1.8 | 6.6 | 8.6 | 0.8 | 2.8 | 1.0 |
| 16 | Native Hawaiian and Other Pacific Islander alo... | 0.1 | 1.3 | 0.3 | 0.3 | 0.5 | 0.2 | 0.1 | 0.1 | 0.1 | ... | 0.001 | 0.001 | 0.1 | 1.0 | 0.0 | 0.1 | 0.8 | 0.0 | 0.1 | 0.1 |
| 17 | Two or More Races, percent, July 1, 2016, (V2... | 1.6 | 7.3 | 2.8 | 2.0 | 3.8 | 3.0 | 2.3 | 2.6 | 2.1 | ... | 0.023 | 0.019 | 1.9 | 2.5 | 1.9 | 2.9 | 4.6 | 1.7 | 1.9 | 2.1 |
| 18 | Hispanic or Latino, percent, July 1, 2016, (V... | 4.2 | 7.0 | 30.9 | 7.3 | 38.9 | 21.3 | 15.7 | 9.2 | 24.9 | ... | 0.037 | 0.052 | 39.1 | 13.8 | 1.9 | 9.1 | 12.4 | 1.5 | 6.7 | 10.0 |
| 19 | White alone, not Hispanic or Latino, percent, ... | 65.8 | 61.2 | 55.5 | 72.9 | 37.7 | 68.6 | 67.7 | 62.9 | 54.9 | ... | 0.825 | 0.742 | 42.6 | 78.8 | 93.1 | 62.4 | 69.5 | 92.3 | 81.7 | 84.1 |
| 34 | High school graduate or higher, percent of per... | 84.3 | 92.1 | 86.0 | 84.8 | 81.8 | 90.7 | 89.9 | 88.4 | 86.9 | ... | 0.909 | 0.855 | 81.9 | 91.2 | 91.8 | 88.3 | 90.4 | 85.0 | 91.0 | 92.3 |
| 35 | Bachelor's degree or higher, percent of person... | 23.5 | 28.0 | 27.5 | 21.1 | 31.4 | 38.1 | 37.6 | 30.0 | 27.3 | ... | 0.270 | 0.249 | 27.6 | 31.1 | 36.0 | 36.3 | 32.9 | 19.2 | 27.8 | 25.7 |
| 47 | Median household income (in 2015 dollars), 201... | 43623.0 | 72515.0 | 50255.0 | 41371.0 | 61818.0 | 60629.0 | 70331.0 | 60509.0 | 47507.0 | ... | 50957.000 | 45219.000 | 53207.0 | 60727.0 | 55176.0 | 65015.0 | 61062.0 | 41751.0 | 53357.0 | 58840.0 |
| 48 | Per capita income in past 12 months (in 2015 d... | 24091.0 | 33413.0 | 25848.0 | 22798.0 | 30318.0 | 32217.0 | 38803.0 | 30554.0 | 26829.0 | ... | 26747.000 | 25227.000 | 26999.0 | 24686.0 | 29894.0 | 34152.0 | 31762.0 | 23450.0 | 28340.0 | 29803.0 |
18 rows × 51 columns
Now that all the rows are changed to numeric without any further error, next thing is look into the types of the data in census dataframe just to make sure if everything looks okay.
#confirming the types of the column
census.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 18 entries, 0 to 48 Data columns (total 51 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Fact 18 non-null object 1 Alabama 18 non-null float64 2 Alaska 18 non-null float64 3 Arizona 18 non-null float64 4 Arkansas 18 non-null float64 5 California 18 non-null float64 6 Colorado 18 non-null float64 7 Connecticut 18 non-null float64 8 Delaware 18 non-null float64 9 Florida 18 non-null float64 10 Georgia 18 non-null float64 11 Hawaii 18 non-null float64 12 Idaho 18 non-null float64 13 Illinois 18 non-null float64 14 Indiana 18 non-null float64 15 Iowa 18 non-null float64 16 Kansas 18 non-null float64 17 Kentucky 18 non-null float64 18 Louisiana 18 non-null float64 19 Maine 18 non-null float64 20 Maryland 18 non-null float64 21 Massachusetts 18 non-null float64 22 Michigan 18 non-null float64 23 Minnesota 18 non-null float64 24 Mississippi 18 non-null float64 25 Missouri 18 non-null float64 26 Montana 18 non-null float64 27 Nebraska 18 non-null float64 28 Nevada 18 non-null float64 29 New Hampshire 18 non-null float64 30 New Jersey 18 non-null float64 31 New Mexico 18 non-null float64 32 New York 18 non-null float64 33 North Carolina 18 non-null float64 34 North Dakota 18 non-null float64 35 Ohio 18 non-null float64 36 Oklahoma 18 non-null float64 37 Oregon 18 non-null float64 38 Pennsylvania 18 non-null float64 39 Rhode Island 18 non-null float64 40 South Carolina 18 non-null float64 41 South Dakota 18 non-null float64 42 Tennessee 18 non-null float64 43 Texas 18 non-null float64 44 Utah 18 non-null float64 45 Vermont 18 non-null float64 46 Virginia 18 non-null float64 47 Washington 18 non-null float64 48 West Virginia 18 non-null float64 49 Wisconsin 18 non-null float64 50 Wyoming 18 non-null float64 dtypes: float64(50), object(1) memory usage: 7.3+ KB
It shows that all the data is cleaned for further analysis. Now that all the column data for census column is cleaned with selection of the Fact column, next thing is to look into details about gun registration dataset.
df_guns.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 12485 entries, 0 to 12484 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 month 12485 non-null object 1 state 12485 non-null object 2 permit 12461 non-null float64 3 permit_recheck 1100 non-null float64 4 handgun 12465 non-null float64 5 long_gun 12466 non-null float64 6 other 5500 non-null float64 7 multiple 12485 non-null int64 8 admin 12462 non-null float64 9 prepawn_handgun 10542 non-null float64 10 prepawn_long_gun 10540 non-null float64 11 prepawn_other 5115 non-null float64 12 redemption_handgun 10545 non-null float64 13 redemption_long_gun 10544 non-null float64 14 redemption_other 5115 non-null float64 15 returned_handgun 2200 non-null float64 16 returned_long_gun 2145 non-null float64 17 returned_other 1815 non-null float64 18 rentals_handgun 990 non-null float64 19 rentals_long_gun 825 non-null float64 20 private_sale_handgun 2750 non-null float64 21 private_sale_long_gun 2750 non-null float64 22 private_sale_other 2750 non-null float64 23 return_to_seller_handgun 2475 non-null float64 24 return_to_seller_long_gun 2750 non-null float64 25 return_to_seller_other 2255 non-null float64 26 totals 12485 non-null int64 dtypes: float64(23), int64(2), object(2) memory usage: 2.6+ MB
The data looks clean with month and state to be string and other columns are integers and floats.
#Looking into month column
df_guns.month
0 2017-09
1 2017-09
2 2017-09
3 2017-09
4 2017-09
...
12480 1998-11
12481 1998-11
12482 1998-11
12483 1998-11
12484 1998-11
Name: month, Length: 12485, dtype: object
Looking into the month column, it shows the year and the month. Therefore, it is converted to datetime object.
# converting month column to datetime
df_guns['month'] = pd.to_datetime(df_guns['month'], format="%Y-%m")
df_guns.month.dtype
dtype('<M8[ns]')
#looking into the info of the guns registration dataframe
df_guns.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 12485 entries, 0 to 12484 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 month 12485 non-null datetime64[ns] 1 state 12485 non-null object 2 permit 12461 non-null float64 3 permit_recheck 1100 non-null float64 4 handgun 12465 non-null float64 5 long_gun 12466 non-null float64 6 other 5500 non-null float64 7 multiple 12485 non-null int64 8 admin 12462 non-null float64 9 prepawn_handgun 10542 non-null float64 10 prepawn_long_gun 10540 non-null float64 11 prepawn_other 5115 non-null float64 12 redemption_handgun 10545 non-null float64 13 redemption_long_gun 10544 non-null float64 14 redemption_other 5115 non-null float64 15 returned_handgun 2200 non-null float64 16 returned_long_gun 2145 non-null float64 17 returned_other 1815 non-null float64 18 rentals_handgun 990 non-null float64 19 rentals_long_gun 825 non-null float64 20 private_sale_handgun 2750 non-null float64 21 private_sale_long_gun 2750 non-null float64 22 private_sale_other 2750 non-null float64 23 return_to_seller_handgun 2475 non-null float64 24 return_to_seller_long_gun 2750 non-null float64 25 return_to_seller_other 2255 non-null float64 26 totals 12485 non-null int64 dtypes: datetime64[ns](1), float64(23), int64(2), object(1) memory usage: 2.6+ MB
From the info, we find that there are missing data in most of the columns except month, state and totals column. Let us further look into if there is any relationship of the "totals" column with other columns.
Totals column represents the total transactions of the guns registration dataframe. Let us further check into it.
#Creating a new dataframe (test) from "permit" column till "return to seller" column and see if it matches with the totals column.
test = df_guns.iloc[:, 2:-1]
test
| permit | permit_recheck | handgun | long_gun | other | multiple | admin | prepawn_handgun | prepawn_long_gun | prepawn_other | ... | returned_long_gun | returned_other | rentals_handgun | rentals_long_gun | private_sale_handgun | private_sale_long_gun | private_sale_other | return_to_seller_handgun | return_to_seller_long_gun | return_to_seller_other | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 16717.0 | 0.0 | 5734.0 | 6320.0 | 221.0 | 317 | 0.0 | 15.0 | 21.0 | 2.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 9.0 | 16.0 | 3.0 | 0.0 | 0.0 | 3.0 |
| 1 | 209.0 | 2.0 | 2320.0 | 2930.0 | 219.0 | 160 | 0.0 | 5.0 | 2.0 | 0.0 | ... | 30.0 | 0.0 | 0.0 | 0.0 | 17.0 | 24.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| 2 | 5069.0 | 382.0 | 11063.0 | 7946.0 | 920.0 | 631 | 0.0 | 13.0 | 6.0 | 0.0 | ... | 5.0 | 0.0 | 0.0 | 0.0 | 38.0 | 12.0 | 2.0 | 0.0 | 0.0 | 0.0 |
| 3 | 2935.0 | 632.0 | 4347.0 | 6063.0 | 165.0 | 366 | 51.0 | 12.0 | 13.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 13.0 | 23.0 | 0.0 | 0.0 | 2.0 | 1.0 |
| 4 | 57839.0 | 0.0 | 37165.0 | 24581.0 | 2984.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 12480 | 0.0 | NaN | 14.0 | 2.0 | NaN | 8 | 0.0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 12481 | 1.0 | NaN | 65.0 | 286.0 | NaN | 8 | 1.0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 12482 | 3.0 | NaN | 149.0 | 251.0 | NaN | 5 | 0.0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 12483 | 0.0 | NaN | 25.0 | 214.0 | NaN | 2 | 0.0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 12484 | 8.0 | NaN | 45.0 | 49.0 | NaN | 5 | 0.0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
12485 rows × 24 columns
#check the relationship of the sum of these columns with the totals column
(test.sum(axis=1) == df_guns.totals).all()
True
#also we can use assert statement
assert (test.sum(axis=1) == df_guns.totals).all()
Looking into these relation, it is confirmed that the totals column is the sum of all the individual transaction corresponding to each of the state. Now that the data is cleaned with the correct datatype, let us begin with exploring the data.
To answer this question, we will first look into the gun registration in all the states and narrow down to the states which has higher registration for analysis.
Because we find that totals is the sum of all the transactions and does not involve any missing data, therefore I will create a dataframe which has month, state and totals column for further analysis. Then I will look into the distribution in various states of US.
# creating guns registration dataframe
gun_dist = df_guns[['month', 'state', 'totals']]
gun_dist
| month | state | totals | |
|---|---|---|---|
| 0 | 2017-09-01 | Alabama | 32019 |
| 1 | 2017-09-01 | Alaska | 6303 |
| 2 | 2017-09-01 | Arizona | 28394 |
| 3 | 2017-09-01 | Arkansas | 17747 |
| 4 | 2017-09-01 | California | 123506 |
| ... | ... | ... | ... |
| 12480 | 1998-11-01 | Virginia | 24 |
| 12481 | 1998-11-01 | Washington | 361 |
| 12482 | 1998-11-01 | West Virginia | 408 |
| 12483 | 1998-11-01 | Wisconsin | 241 |
| 12484 | 1998-11-01 | Wyoming | 107 |
12485 rows × 3 columns
#looking into the dataset for year 2017 and september month
gun_dist[gun_dist['month'] == "2017-09-01"]
| month | state | totals | |
|---|---|---|---|
| 0 | 2017-09-01 | Alabama | 32019 |
| 1 | 2017-09-01 | Alaska | 6303 |
| 2 | 2017-09-01 | Arizona | 28394 |
| 3 | 2017-09-01 | Arkansas | 17747 |
| 4 | 2017-09-01 | California | 123506 |
| 5 | 2017-09-01 | Colorado | 35873 |
| 6 | 2017-09-01 | Connecticut | 12117 |
| 7 | 2017-09-01 | Delaware | 3502 |
| 8 | 2017-09-01 | District of Columbia | 61 |
| 9 | 2017-09-01 | Florida | 77390 |
| 10 | 2017-09-01 | Georgia | 35371 |
| 11 | 2017-09-01 | Guam | 135 |
| 12 | 2017-09-01 | Hawaii | 948 |
| 13 | 2017-09-01 | Idaho | 14938 |
| 14 | 2017-09-01 | Illinois | 116449 |
| 15 | 2017-09-01 | Indiana | 48524 |
| 16 | 2017-09-01 | Iowa | 9975 |
| 17 | 2017-09-01 | Kansas | 12856 |
| 18 | 2017-09-01 | Kentucky | 398706 |
| 19 | 2017-09-01 | Louisiana | 27821 |
| 20 | 2017-09-01 | Maine | 8715 |
| 21 | 2017-09-01 | Mariana Islands | 16 |
| 22 | 2017-09-01 | Maryland | 11255 |
| 23 | 2017-09-01 | Massachusetts | 15460 |
| 24 | 2017-09-01 | Michigan | 36944 |
| 25 | 2017-09-01 | Minnesota | 44552 |
| 26 | 2017-09-01 | Mississippi | 17491 |
| 27 | 2017-09-01 | Missouri | 37571 |
| 28 | 2017-09-01 | Montana | 10283 |
| 29 | 2017-09-01 | Nebraska | 5345 |
| 30 | 2017-09-01 | Nevada | 9460 |
| 31 | 2017-09-01 | New Hampshire | 10689 |
| 32 | 2017-09-01 | New Jersey | 7165 |
| 33 | 2017-09-01 | New Mexico | 12491 |
| 34 | 2017-09-01 | New York | 30703 |
| 35 | 2017-09-01 | North Carolina | 37325 |
| 36 | 2017-09-01 | North Dakota | 5212 |
| 37 | 2017-09-01 | Ohio | 49942 |
| 38 | 2017-09-01 | Oklahoma | 23371 |
| 39 | 2017-09-01 | Oregon | 27277 |
| 40 | 2017-09-01 | Pennsylvania | 77003 |
| 41 | 2017-09-01 | Puerto Rico | 791 |
| 42 | 2017-09-01 | Rhode Island | 1799 |
| 43 | 2017-09-01 | South Carolina | 25930 |
| 44 | 2017-09-01 | South Dakota | 7457 |
| 45 | 2017-09-01 | Tennessee | 50945 |
| 46 | 2017-09-01 | Texas | 128260 |
| 47 | 2017-09-01 | Utah | 20041 |
| 48 | 2017-09-01 | Vermont | 3084 |
| 49 | 2017-09-01 | Virgin Islands | 9 |
| 50 | 2017-09-01 | Virginia | 36446 |
| 51 | 2017-09-01 | Washington | 43049 |
| 52 | 2017-09-01 | West Virginia | 16723 |
| 53 | 2017-09-01 | Wisconsin | 37506 |
| 54 | 2017-09-01 | Wyoming | 4281 |
We find that there are 54 states and for each month and year (1998-2017), there is a gun registration data leading to 12485 rows. In census dataset, the states are 50. Let us first match the states from census and guns_dist dataframe and select the rows which match states in both dataframes.
# creating the dataframe with same states of census data. States list is already created initially while cleaning the data from the census dataset.
gun_dist = gun_dist.query('state in @states')
gun_dist
| month | state | totals | |
|---|---|---|---|
| 0 | 2017-09-01 | Alabama | 32019 |
| 1 | 2017-09-01 | Alaska | 6303 |
| 2 | 2017-09-01 | Arizona | 28394 |
| 3 | 2017-09-01 | Arkansas | 17747 |
| 4 | 2017-09-01 | California | 123506 |
| ... | ... | ... | ... |
| 12480 | 1998-11-01 | Virginia | 24 |
| 12481 | 1998-11-01 | Washington | 361 |
| 12482 | 1998-11-01 | West Virginia | 408 |
| 12483 | 1998-11-01 | Wisconsin | 241 |
| 12484 | 1998-11-01 | Wyoming | 107 |
11350 rows × 3 columns
This narrows down to 11350 rows from 12485 rows with states matching with census dataset.
Next thing I am looking is the distribution and see the counts belonging corresponding to the total gun registration.
# next thing is to plot histogram looking into the distribution of totals column in these states
gun_dist['totals'].hist(alpha=0.8, label="totals", bins=20, figsize=(8, 6))
plt.xlabel('Total Gun Registration',fontsize=15)
plt.ylabel("Count", fontsize=20)
plt.title("Count of gun registration for 50 states of US")
plt.legend()
plt.xticks([100000, 200000, 300000, 400000, 500000], ["100K", "200K", "300K", "400K", "500K"], rotation=45)
plt.tick_params(labelsize=12);
Looking into the histogram, most of the data lies within the range 100K. Let us further look into the statistics.
#mean of the totals of all the states: national mean
national_mean = gun_dist.totals.mean()
min_totals = gun_dist.totals.min()
max_totals = gun_dist.totals.max()
print(national_mean, min_totals, max_totals)
23734.978502202644 6 541978
The national mean is 23K whereas minimum is 6 while maximum is 541K. This means that data has higher variation. Let us further look into the box plot.
#Looking into box plot to get better understanding of the data distribution
sns.set_style('ticks')
fig, ax = plt.subplots()
fig.set_size_inches(11.7, 15)
sns.boxplot(y='state', x="totals", data=gun_dist, orient="h", ax=ax)
plt.xlabel('Total',fontsize=15)
plt.ylabel("States", fontsize=15)
plt.title("Growth of gun registration for 50 states of US over the years 1998-2017", fontsize=22)
plt.xticks([100000, 200000, 300000, 400000, 500000], ["100K", "200K", "300K", "400K", "500K"], rotation=45)
plt.tick_params(labelsize=12);
Based on this visualisation, Kentucky has increased gun registration. Based on this, I have narrowed down to 10 states namely California, Florida, Illinois, Indiana, Kentucky, North Carolina, Texas, Pennsylvania, Ohio, Utah to see the distribution over the years.
Let us look closely at these states looking into the trend from 1998 till 2017.
#Dataframe for selected states
selected_states = [ "California", "Florida", "Illinois", "Indiana", "Kentucky", "North Carolina", "Texas", "Pennsylvania", "Ohio", "Utah"]
df_selected_states = gun_dist.query('state in @selected_states')
df_selected_states
| month | state | totals | |
|---|---|---|---|
| 4 | 2017-09-01 | California | 123506 |
| 9 | 2017-09-01 | Florida | 77390 |
| 14 | 2017-09-01 | Illinois | 116449 |
| 15 | 2017-09-01 | Indiana | 48524 |
| 18 | 2017-09-01 | Kentucky | 398706 |
| ... | ... | ... | ... |
| 12465 | 1998-11-01 | North Carolina | 1310 |
| 12467 | 1998-11-01 | Ohio | 952 |
| 12470 | 1998-11-01 | Pennsylvania | 17 |
| 12476 | 1998-11-01 | Texas | 2794 |
| 12477 | 1998-11-01 | Utah | 267 |
2270 rows × 3 columns
#Line plot showing in total distribution over the years in these selected states
#sns.set_style('ticks')
fig, ax = plt.subplots()
fig.set_size_inches(15.7, 9.7)
sns.lineplot(y='totals', x="month", data=df_selected_states, hue='state', palette="cubehelix", ax=ax)
plt.legend(loc='upper left')
plt.xlabel('Year',fontsize=15)
plt.ylabel("Total Gun registration", fontsize=15)
plt.title('Trend in Gun registration over the years 1998-2017 for selected states', fontsize=25)
plt.yticks([100000, 200000, 300000, 400000, 500000], ["100K", "200K", "300K", "400K", "500K"])
plt.tick_params(labelsize=12);
Looking into the plot, following are points to note:
Kentucky shows the increase in the gun distribution over the years from 1998 to 2018 except in 2017 where it showed the dip.
North Carolina shows a sharp increase over 500K for the year 2014 while it is lower in number for the other years.
California shows the increase in the distribution mainly after 2012.
Texas shows some increase in 2013 and following years while Utah also show the increase for the years 2010-2012. Rest all looks like slight increase in gun registration over the years.
Now that we look into the plot, selecting only to these five states Kentucky, North Carolina, California, Texas and Utah gives the clear picture of the trend over the years.
#Selction of rows for these 5 selected states to further look into the trend over the years
five_states = ["Kentucky", "North Carolina", "California", "Texas", "Utah"]
df_five_states = gun_dist.query('state in @five_states')
df_five_states
| month | state | totals | |
|---|---|---|---|
| 4 | 2017-09-01 | California | 123506 |
| 18 | 2017-09-01 | Kentucky | 398706 |
| 35 | 2017-09-01 | North Carolina | 37325 |
| 46 | 2017-09-01 | Texas | 128260 |
| 47 | 2017-09-01 | Utah | 20041 |
| ... | ... | ... | ... |
| 12434 | 1998-11-01 | California | 2101 |
| 12448 | 1998-11-01 | Kentucky | 840 |
| 12465 | 1998-11-01 | North Carolina | 1310 |
| 12476 | 1998-11-01 | Texas | 2794 |
| 12477 | 1998-11-01 | Utah | 267 |
1135 rows × 3 columns
#Line plot of selected states
a4_dims = (11.7, 8.27)
fig, ax = plt.subplots(figsize=a4_dims)
sns.lineplot(ax=ax, x="month", y='totals', data=df_five_states, style="state", hue='state')
plt.xlabel('Year',fontsize=20)
plt.ylabel("Totals in Gun reistration", fontsize=20)
plt.title('Growth in Gun registration over the years 1998-2017 for California, Kentucky, North Carolina, Utah and Texas', fontsize=25)
plt.yticks([100000, 200000, 300000, 400000, 500000], ["100K", "200K", "300K", "400K", "500K"], fontsize=20)
ax.tick_params(axis="x", labelsize=18, rotation=45)
ax.tick_params(axis="y", labelsize=20)
plt.show();
The plot shows that Utah and North Carolina has lower registration except for the years 2010-12 for Utah and 2014 for North Carolina.
Kentucky, California and Texas shows the increase in the gun registration and therefore we will look over these three states for subsequent analysis.
We find the states showing highest growth in gun registration, let us look further look into the population of these states and see if there is any impact of population with the gun registration.
For that, I am choosing three states, California, Kentucky and Texas which shows an increase in gun registration.
#looking into the dataset
census.head()
| Fact | Alabama | Alaska | Arizona | Arkansas | California | Colorado | Connecticut | Delaware | Florida | ... | South Dakota | Tennessee | Texas | Utah | Vermont | Virginia | Washington | West Virginia | Wisconsin | Wyoming | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Population estimates, July 1, 2016, (V2016) | 4863300.0 | 741894.0 | 6931071.0 | 2988248.0 | 39250017.0 | 5540545.0 | 3576452.0 | 952065.0 | 20612439.0 | ... | 865454.000 | 6651194.000 | 27862596.0 | 3051217.0 | 624594.0 | 8411808.0 | 7288000.0 | 1831102.0 | 5778708.0 | 585501.0 |
| 1 | Population estimates base, April 1, 2010, (V2... | 4780131.0 | 710249.0 | 6392301.0 | 2916025.0 | 37254522.0 | 5029324.0 | 3574114.0 | 897936.0 | 18804592.0 | ... | 814195.000 | 6346298.000 | 25146100.0 | 2763888.0 | 625741.0 | 8001041.0 | 6724545.0 | 1853011.0 | 5687289.0 | 563767.0 |
| 6 | Persons under 18 years, percent, July 1, 2016,... | 22.6 | 25.2 | 23.5 | 23.6 | 23.2 | 22.8 | 21.1 | 21.5 | 20.1 | ... | 0.246 | 0.226 | 26.2 | 30.2 | 19.0 | 22.2 | 22.4 | 20.5 | 22.3 | 23.7 |
| 7 | Persons under 18 years, percent, April 1, 2010 | 23.7 | 26.4 | 25.5 | 24.4 | 25.0 | 24.4 | 22.9 | 22.9 | 21.3 | ... | 0.249 | 0.236 | 27.3 | 31.5 | 20.7 | 23.2 | 23.5 | 20.9 | 23.6 | 24.0 |
| 8 | Persons 65 years and over, percent, July 1, 2... | 16.1 | 10.4 | 16.9 | 16.3 | 13.6 | 13.4 | 16.1 | 17.5 | 19.9 | ... | 0.160 | 0.157 | 12.0 | 10.5 | 18.1 | 14.6 | 14.8 | 18.8 | 16.1 | 15.0 |
5 rows × 51 columns
Only first two rows shows the distribution of population for the year 2010 and 2016, let us select those rows and for three cities: Kentucky, California and Texas.
#selection of the data set
#first getting the index position of columns Kentucky, California and Texas
for i, j in enumerate(census.columns):
print(i,j)
0 Fact 1 Alabama 2 Alaska 3 Arizona 4 Arkansas 5 California 6 Colorado 7 Connecticut 8 Delaware 9 Florida 10 Georgia 11 Hawaii 12 Idaho 13 Illinois 14 Indiana 15 Iowa 16 Kansas 17 Kentucky 18 Louisiana 19 Maine 20 Maryland 21 Massachusetts 22 Michigan 23 Minnesota 24 Mississippi 25 Missouri 26 Montana 27 Nebraska 28 Nevada 29 New Hampshire 30 New Jersey 31 New Mexico 32 New York 33 North Carolina 34 North Dakota 35 Ohio 36 Oklahoma 37 Oregon 38 Pennsylvania 39 Rhode Island 40 South Carolina 41 South Dakota 42 Tennessee 43 Texas 44 Utah 45 Vermont 46 Virginia 47 Washington 48 West Virginia 49 Wisconsin 50 Wyoming
#selection of dataset
census_pop = census.iloc[0:2, [0, 5,17,43]]
census_pop
| Fact | California | Kentucky | Texas | |
|---|---|---|---|---|
| 0 | Population estimates, July 1, 2016, (V2016) | 39250017.0 | 4436974.0 | 27862596.0 |
| 1 | Population estimates base, April 1, 2010, (V2... | 37254522.0 | 4339344.0 | 25146100.0 |
Looking into the data, we want to change the columns states to rows and for Fact column, we need to extract the year and population and split in two seperate columns.
# Using replace for extract year and fact from the data
census_pop = census_pop.replace({'Fact': {'Population estimates, July 1, 2016, (V2016)': 'Population, 2016',
'Population estimates base, April 1, 2010, (V2016)' : 'Population, 2010'}})
census_pop
| Fact | California | Kentucky | Texas | |
|---|---|---|---|---|
| 0 | Population, 2016 | 39250017.0 | 4436974.0 | 27862596.0 |
| 1 | Population, 2010 | 37254522.0 | 4339344.0 | 25146100.0 |
#Then to rename Fact column to Fact,Year before we split into two columns
census_pop = census_pop.rename(columns={'Fact': "Fact, Year"})
census_pop
| Fact, Year | California | Kentucky | Texas | |
|---|---|---|---|---|
| 0 | Population, 2016 | 39250017.0 | 4436974.0 | 27862596.0 |
| 1 | Population, 2010 | 37254522.0 | 4339344.0 | 25146100.0 |
#splitting to two seperate columns
census_pop[['Fact', 'Year']] = census_pop['Fact, Year'].str.split(',', expand=True)
census_pop
| Fact, Year | California | Kentucky | Texas | Fact | Year | |
|---|---|---|---|---|---|---|
| 0 | Population, 2016 | 39250017.0 | 4436974.0 | 27862596.0 | Population | 2016 |
| 1 | Population, 2010 | 37254522.0 | 4339344.0 | 25146100.0 | Population | 2010 |
#now we will select only Year, states column of the dataframe
census_pop = census_pop[["Year", "California", "Kentucky", "Texas"]]
census_pop
| Year | California | Kentucky | Texas | |
|---|---|---|---|---|
| 0 | 2016 | 39250017.0 | 4436974.0 | 27862596.0 |
| 1 | 2010 | 37254522.0 | 4339344.0 | 25146100.0 |
#next thing is to use melt method and change to the rows and named dataframe pop
pop = pd.melt(census_pop, id_vars='Year', var_name='state', value_name='population')
pop
| Year | state | population | |
|---|---|---|---|
| 0 | 2016 | California | 39250017.0 |
| 1 | 2010 | California | 37254522.0 |
| 2 | 2016 | Kentucky | 4436974.0 |
| 3 | 2010 | Kentucky | 4339344.0 |
| 4 | 2016 | Texas | 27862596.0 |
| 5 | 2010 | Texas | 25146100.0 |
#looking into the data type
pop.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6 entries, 0 to 5 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 6 non-null object 1 state 6 non-null object 2 population 6 non-null float64 dtypes: float64(1), object(2) memory usage: 272.0+ bytes
#changing Year to int object and again looking into the types
pop['Year'] = pop.Year.astype('int')
pop.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6 entries, 0 to 5 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 6 non-null int64 1 state 6 non-null object 2 population 6 non-null float64 dtypes: float64(1), int64(1), object(1) memory usage: 272.0+ bytes
#Creating dataframe for three states showing the registration data and year 2016
three_states = ["California", "Kentucky", "Texas"]
reg_2016 = gun_dist.query('state in @three_states & ((month >= "2016" & month <"2017"))')
reg_2016.shape
(36, 3)
#Creating dataframe for three states showing the registration data and year 2010
reg_2010 = gun_dist.query('state in @three_states & ((month >= "2010" & month <"2011"))')
reg_2010.shape
(36, 3)
#combined 2010 and 2016 dataframes
reg = pd.concat([reg_2010, reg_2016])
reg
| month | state | totals | |
|---|---|---|---|
| 4459 | 2010-12-01 | California | 81522 |
| 4473 | 2010-12-01 | Kentucky | 188188 |
| 4501 | 2010-12-01 | Texas | 126861 |
| 4514 | 2010-11-01 | California | 63995 |
| 4528 | 2010-11-01 | Kentucky | 181908 |
| ... | ... | ... | ... |
| 1063 | 2016-02-01 | Kentucky | 266913 |
| 1091 | 2016-02-01 | Texas | 192417 |
| 1104 | 2016-01-01 | California | 220912 |
| 1118 | 2016-01-01 | Kentucky | 302383 |
| 1146 | 2016-01-01 | Texas | 164088 |
72 rows × 3 columns
# extracting year and changing to datetime
reg['month'] = pd.to_datetime(reg['month'], format='%Y')
reg['Year'] = reg['month'].dt.year
#reg['Year']
reg = reg[['Year', 'state', 'totals']]
reg
| Year | state | totals | |
|---|---|---|---|
| 4459 | 2010 | California | 81522 |
| 4473 | 2010 | Kentucky | 188188 |
| 4501 | 2010 | Texas | 126861 |
| 4514 | 2010 | California | 63995 |
| 4528 | 2010 | Kentucky | 181908 |
| ... | ... | ... | ... |
| 1063 | 2016 | Kentucky | 266913 |
| 1091 | 2016 | Texas | 192417 |
| 1104 | 2016 | California | 220912 |
| 1118 | 2016 | Kentucky | 302383 |
| 1146 | 2016 | Texas | 164088 |
72 rows × 3 columns
#confirming the selection for the year
reg.Year.unique()
array([2010, 2016])
#looking into the types of the columns for each of the dataframe
reg.info()
pop.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 72 entries, 4459 to 1146 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 72 non-null int64 1 state 72 non-null object 2 totals 72 non-null int64 dtypes: int64(2), object(1) memory usage: 2.2+ KB <class 'pandas.core.frame.DataFrame'> RangeIndex: 6 entries, 0 to 5 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 6 non-null int64 1 state 6 non-null object 2 population 6 non-null float64 dtypes: float64(1), int64(1), object(1) memory usage: 272.0+ bytes
#combining pop and refined guns registration data set
df = pop.merge(reg, how='inner', on=('Year', 'state'))
df
| Year | state | population | totals | |
|---|---|---|---|---|
| 0 | 2016 | California | 39250017.0 | 298161 |
| 1 | 2016 | California | 39250017.0 | 224039 |
| 2 | 2016 | California | 39250017.0 | 180968 |
| 3 | 2016 | California | 39250017.0 | 159077 |
| 4 | 2016 | California | 39250017.0 | 176770 |
| ... | ... | ... | ... | ... |
| 67 | 2010 | Texas | 25146100.0 | 61789 |
| 68 | 2010 | Texas | 25146100.0 | 65012 |
| 69 | 2010 | Texas | 25146100.0 | 77795 |
| 70 | 2010 | Texas | 25146100.0 | 87849 |
| 71 | 2010 | Texas | 25146100.0 | 73787 |
72 rows × 4 columns
#creating a mask for year 2010 and 2016
pop_2010 = df.query('Year == 2010')
pop_2016 = df.query('Year == 2016')
pop_2016.shape
(36, 4)
Now that we got the dataframe for the year 2010 and 2016 with population and gun registration. Let us look into how gun registration varies with population for these states.
#distribution of gun registration with population in 2010 using pairplot
plt.rcParams['figure.figsize']=5,5
sns.pairplot(pop_2010, vars=['population', 'totals'], hue='state', palette='husl', height=3)
plt.title('Population and gun distribution pattern in 2010', y=2.1, fontsize = 16)
plt.show();
#distribution of gun registration with population in 2016
sns.pairplot(pop_2016, vars=['population', 'totals'], hue='state', palette='husl', height=3)
plt.title('Population and gun distribution pattern in 2016', y=2.1, fontsize = 16)
plt.show();
Summary from the pairplot graphs for states California, Kentucky and Texas for the years 2010 and 2016
The population data from census column is given one data point for the complete year while gun registration data is available for all 12 months for the year 2010 and 2016. This is shown for each of the 12 points corresponding to the population data.
The plots clearly shows that the total gun registration is higher in 2016 as compared to 2010 for all these states.
The other point is for Kentucky, the gun registration has increased from 2010 to 2016 although the population is not that high compared to other states California and Texas.
For California and Texas, there is increase in population and the gun registration as well. Although this figure gives some insight about the gun registration pattern with population, more census data is required to provide better insights about the correlation.
Looking into the statistics for years 2010 and 2016:
#Mean gun registration and population for 2016
summary_2010 = pop_2010.groupby('state')['population', 'totals'].mean()
summary_2010
| population | totals | |
|---|---|---|
| state | ||
| California | 37254522.0 | 68033.250000 |
| Kentucky | 4339344.0 | 198798.250000 |
| Texas | 25146100.0 | 80672.583333 |
#Mean gun registration and population for 2016
summary_2016 = pop_2016.groupby('state')['population', 'totals'].mean()
summary_2016
| population | totals | |
|---|---|---|
| state | ||
| California | 39250017.0 | 198097.250000 |
| Kentucky | 4436974.0 | 306403.916667 |
| Texas | 27862596.0 | 143477.166667 |
# net change from 2010
inc = summary_2016 - summary_2010
inc
| population | totals | |
|---|---|---|
| state | ||
| California | 1995495.0 | 130064.000000 |
| Kentucky | 97630.0 | 107605.666667 |
| Texas | 2716496.0 | 62804.583333 |
There is an increase in gun registration and population from 2010 to 2016. However, California and Kentucky shows an increase of more than 100K registration although population of Kentucky increase by 97K while population in California increase by approx 200K.
census
| Fact | Alabama | Alaska | Arizona | Arkansas | California | Colorado | Connecticut | Delaware | Florida | ... | South Dakota | Tennessee | Texas | Utah | Vermont | Virginia | Washington | West Virginia | Wisconsin | Wyoming | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Population estimates, July 1, 2016, (V2016) | 4863300.0 | 741894.0 | 6931071.0 | 2988248.0 | 39250017.0 | 5540545.0 | 3576452.0 | 952065.0 | 20612439.0 | ... | 865454.000 | 6651194.000 | 27862596.0 | 3051217.0 | 624594.0 | 8411808.0 | 7288000.0 | 1831102.0 | 5778708.0 | 585501.0 |
| 1 | Population estimates base, April 1, 2010, (V2... | 4780131.0 | 710249.0 | 6392301.0 | 2916025.0 | 37254522.0 | 5029324.0 | 3574114.0 | 897936.0 | 18804592.0 | ... | 814195.000 | 6346298.000 | 25146100.0 | 2763888.0 | 625741.0 | 8001041.0 | 6724545.0 | 1853011.0 | 5687289.0 | 563767.0 |
| 6 | Persons under 18 years, percent, July 1, 2016,... | 22.6 | 25.2 | 23.5 | 23.6 | 23.2 | 22.8 | 21.1 | 21.5 | 20.1 | ... | 0.246 | 0.226 | 26.2 | 30.2 | 19.0 | 22.2 | 22.4 | 20.5 | 22.3 | 23.7 |
| 7 | Persons under 18 years, percent, April 1, 2010 | 23.7 | 26.4 | 25.5 | 24.4 | 25.0 | 24.4 | 22.9 | 22.9 | 21.3 | ... | 0.249 | 0.236 | 27.3 | 31.5 | 20.7 | 23.2 | 23.5 | 20.9 | 23.6 | 24.0 |
| 8 | Persons 65 years and over, percent, July 1, 2... | 16.1 | 10.4 | 16.9 | 16.3 | 13.6 | 13.4 | 16.1 | 17.5 | 19.9 | ... | 0.160 | 0.157 | 12.0 | 10.5 | 18.1 | 14.6 | 14.8 | 18.8 | 16.1 | 15.0 |
| 9 | Persons 65 years and over, percent, April 1, 2010 | 13.8 | 7.7 | 13.8 | 14.4 | 11.4 | 10.9 | 14.2 | 14.4 | 17.3 | ... | 0.143 | 0.134 | 10.3 | 9.0 | 14.6 | 12.2 | 12.3 | 16.0 | 13.7 | 12.4 |
| 12 | White alone, percent, July 1, 2016, (V2016) | 69.3 | 66.1 | 83.3 | 79.4 | 72.7 | 87.5 | 80.6 | 70.1 | 77.6 | ... | 0.852 | 0.787 | 79.4 | 91.1 | 94.6 | 70.0 | 80.0 | 93.6 | 87.5 | 92.8 |
| 13 | Black or African American alone, percent, July... | 26.8 | 3.8 | 4.9 | 15.7 | 6.5 | 4.5 | 11.8 | 22.6 | 16.8 | ... | 0.020 | 0.171 | 12.6 | 1.4 | 1.3 | 19.8 | 4.1 | 3.6 | 6.6 | 1.3 |
| 14 | American Indian and Alaska Native alone, perce... | 0.7 | 15.2 | 5.4 | 1.0 | 1.7 | 1.6 | 0.5 | 0.6 | 0.5 | ... | 0.090 | 0.004 | 1.0 | 1.6 | 0.4 | 0.5 | 1.9 | 0.2 | 1.1 | 2.7 |
| 15 | Asian alone, percent, July 1, 2016, (V2016) | 1.4 | 6.3 | 3.4 | 1.6 | 14.8 | 3.3 | 4.7 | 4.0 | 2.9 | ... | 0.015 | 0.018 | 4.8 | 2.5 | 1.8 | 6.6 | 8.6 | 0.8 | 2.8 | 1.0 |
| 16 | Native Hawaiian and Other Pacific Islander alo... | 0.1 | 1.3 | 0.3 | 0.3 | 0.5 | 0.2 | 0.1 | 0.1 | 0.1 | ... | 0.001 | 0.001 | 0.1 | 1.0 | 0.0 | 0.1 | 0.8 | 0.0 | 0.1 | 0.1 |
| 17 | Two or More Races, percent, July 1, 2016, (V2... | 1.6 | 7.3 | 2.8 | 2.0 | 3.8 | 3.0 | 2.3 | 2.6 | 2.1 | ... | 0.023 | 0.019 | 1.9 | 2.5 | 1.9 | 2.9 | 4.6 | 1.7 | 1.9 | 2.1 |
| 18 | Hispanic or Latino, percent, July 1, 2016, (V... | 4.2 | 7.0 | 30.9 | 7.3 | 38.9 | 21.3 | 15.7 | 9.2 | 24.9 | ... | 0.037 | 0.052 | 39.1 | 13.8 | 1.9 | 9.1 | 12.4 | 1.5 | 6.7 | 10.0 |
| 19 | White alone, not Hispanic or Latino, percent, ... | 65.8 | 61.2 | 55.5 | 72.9 | 37.7 | 68.6 | 67.7 | 62.9 | 54.9 | ... | 0.825 | 0.742 | 42.6 | 78.8 | 93.1 | 62.4 | 69.5 | 92.3 | 81.7 | 84.1 |
| 34 | High school graduate or higher, percent of per... | 84.3 | 92.1 | 86.0 | 84.8 | 81.8 | 90.7 | 89.9 | 88.4 | 86.9 | ... | 0.909 | 0.855 | 81.9 | 91.2 | 91.8 | 88.3 | 90.4 | 85.0 | 91.0 | 92.3 |
| 35 | Bachelor's degree or higher, percent of person... | 23.5 | 28.0 | 27.5 | 21.1 | 31.4 | 38.1 | 37.6 | 30.0 | 27.3 | ... | 0.270 | 0.249 | 27.6 | 31.1 | 36.0 | 36.3 | 32.9 | 19.2 | 27.8 | 25.7 |
| 47 | Median household income (in 2015 dollars), 201... | 43623.0 | 72515.0 | 50255.0 | 41371.0 | 61818.0 | 60629.0 | 70331.0 | 60509.0 | 47507.0 | ... | 50957.000 | 45219.000 | 53207.0 | 60727.0 | 55176.0 | 65015.0 | 61062.0 | 41751.0 | 53357.0 | 58840.0 |
| 48 | Per capita income in past 12 months (in 2015 d... | 24091.0 | 33413.0 | 25848.0 | 22798.0 | 30318.0 | 32217.0 | 38803.0 | 30554.0 | 26829.0 | ... | 26747.000 | 25227.000 | 26999.0 | 24686.0 | 29894.0 | 34152.0 | 31762.0 | 23450.0 | 28340.0 | 29803.0 |
18 rows × 51 columns
#selection of dataset
census_income = census.iloc[-2:, [0, 5, 17,43]]
census_income
| Fact | California | Kentucky | Texas | |
|---|---|---|---|---|
| 47 | Median household income (in 2015 dollars), 201... | 61818.0 | 43740.0 | 53207.0 |
| 48 | Per capita income in past 12 months (in 2015 d... | 30318.0 | 24063.0 | 26999.0 |
#looking into the details of the fact column showing the year
census_income.Fact.unique()
array(['Median household income (in 2015 dollars), 2011-2015',
'Per capita income in past 12 months (in 2015 dollars), 2011-2015'],
dtype=object)
# Using replace for extract year and fact from the data
census_income = census_income.replace({'Fact': {'Median household income (in 2015 dollars), 2011-2015': 'household_income, 2011-2015',
'Per capita income in past 12 months (in 2015 dollars), 2011-2015' : 'percapita_income, 2011-2015'}})
census_income
| Fact | California | Kentucky | Texas | |
|---|---|---|---|---|
| 47 | household_income, 2011-2015 | 61818.0 | 43740.0 | 53207.0 |
| 48 | percapita_income, 2011-2015 | 30318.0 | 24063.0 | 26999.0 |
#rename the Fact column
census_income = census_income.rename(columns={'Fact': "Fact, Year"})
census_income
| Fact, Year | California | Kentucky | Texas | |
|---|---|---|---|---|
| 47 | household_income, 2011-2015 | 61818.0 | 43740.0 | 53207.0 |
| 48 | percapita_income, 2011-2015 | 30318.0 | 24063.0 | 26999.0 |
#splitting to two seperate columns
census_income[['Fact', 'Year']] = census_income['Fact, Year'].str.split(',', expand=True)
census_income
| Fact, Year | California | Kentucky | Texas | Fact | Year | |
|---|---|---|---|---|---|---|
| 47 | household_income, 2011-2015 | 61818.0 | 43740.0 | 53207.0 | household_income | 2011-2015 |
| 48 | percapita_income, 2011-2015 | 30318.0 | 24063.0 | 26999.0 | percapita_income | 2011-2015 |
#now we will select only Year, states column of the dataframe
census_income = census_income[["Fact", "California", "Kentucky", "Texas"]]
census_income
| Fact | California | Kentucky | Texas | |
|---|---|---|---|---|
| 47 | household_income | 61818.0 | 43740.0 | 53207.0 |
| 48 | percapita_income | 30318.0 | 24063.0 | 26999.0 |
#next thing is to use melt method and change to the rows
income = pd.melt(census_income, id_vars='Fact', var_name='state', value_name='income')
income
| Fact | state | income | |
|---|---|---|---|
| 0 | household_income | California | 61818.0 |
| 1 | percapita_income | California | 30318.0 |
| 2 | household_income | Kentucky | 43740.0 |
| 3 | percapita_income | Kentucky | 24063.0 |
| 4 | household_income | Texas | 53207.0 |
| 5 | percapita_income | Texas | 26999.0 |
#next thing is to filter the rows of guns registration dataset for the years 2011 till 2015
#combining with guns registration dataset for 2016
three_states = ["California", "Kentucky", "Texas"]
reg_2015 = gun_dist.query('state in @three_states & ((month >= "2011" & month <"2016"))')
reg_2015 = reg_2015[['state', 'totals']]
reg_2015
| state | totals | |
|---|---|---|
| 1159 | California | 252946 |
| 1173 | Kentucky | 308824 |
| 1201 | Texas | 241791 |
| 1214 | California | 180116 |
| 1228 | Kentucky | 295891 |
| ... | ... | ... |
| 4363 | Kentucky | 188375 |
| 4391 | Texas | 98234 |
| 4404 | California | 68984 |
| 4418 | Kentucky | 177464 |
| 4446 | Texas | 85642 |
180 rows × 2 columns
#combining income and refined guns registration data set for the years 2011 till 2015
df_income = income.merge(reg_2015, how='inner', on=('state'))
df_income
| Fact | state | income | totals | |
|---|---|---|---|---|
| 0 | household_income | California | 61818.0 | 252946 |
| 1 | household_income | California | 61818.0 | 180116 |
| 2 | household_income | California | 61818.0 | 163211 |
| 3 | household_income | California | 61818.0 | 125441 |
| 4 | household_income | California | 61818.0 | 125122 |
| ... | ... | ... | ... | ... |
| 355 | percapita_income | Texas | 26999.0 | 81733 |
| 356 | percapita_income | Texas | 26999.0 | 87448 |
| 357 | percapita_income | Texas | 26999.0 | 97274 |
| 358 | percapita_income | Texas | 26999.0 | 98234 |
| 359 | percapita_income | Texas | 26999.0 | 85642 |
360 rows × 4 columns
#Looking into the california dataset to check if merge is done correctly
df_income.query('(state == "California") & (Fact == "household_income")')
| Fact | state | income | totals | |
|---|---|---|---|---|
| 0 | household_income | California | 61818.0 | 252946 |
| 1 | household_income | California | 61818.0 | 180116 |
| 2 | household_income | California | 61818.0 | 163211 |
| 3 | household_income | California | 61818.0 | 125441 |
| 4 | household_income | California | 61818.0 | 125122 |
| 5 | household_income | California | 61818.0 | 140365 |
| 6 | household_income | California | 61818.0 | 112704 |
| 7 | household_income | California | 61818.0 | 119574 |
| 8 | household_income | California | 61818.0 | 114686 |
| 9 | household_income | California | 61818.0 | 199833 |
| 10 | household_income | California | 61818.0 | 113789 |
| 11 | household_income | California | 61818.0 | 113292 |
| 12 | household_income | California | 61818.0 | 164170 |
| 13 | household_income | California | 61818.0 | 132469 |
| 14 | household_income | California | 61818.0 | 128439 |
| 15 | household_income | California | 61818.0 | 118856 |
| 16 | household_income | California | 61818.0 | 119570 |
| 17 | household_income | California | 61818.0 | 104309 |
| 18 | household_income | California | 61818.0 | 102639 |
| 19 | household_income | California | 61818.0 | 120661 |
| 20 | household_income | California | 61818.0 | 134448 |
| 21 | household_income | California | 61818.0 | 137660 |
| 22 | household_income | California | 61818.0 | 111341 |
| 23 | household_income | California | 61818.0 | 100054 |
| 24 | household_income | California | 61818.0 | 138522 |
| 25 | household_income | California | 61818.0 | 109140 |
| 26 | household_income | California | 61818.0 | 114595 |
| 27 | household_income | California | 61818.0 | 104067 |
| 28 | household_income | California | 61818.0 | 111132 |
| 29 | household_income | California | 61818.0 | 100114 |
| 30 | household_income | California | 61818.0 | 109153 |
| 31 | household_income | California | 61818.0 | 112464 |
| 32 | household_income | California | 61818.0 | 110149 |
| 33 | household_income | California | 61818.0 | 117106 |
| 34 | household_income | California | 61818.0 | 108666 |
| 35 | household_income | California | 61818.0 | 133187 |
| 36 | household_income | California | 61818.0 | 150805 |
| 37 | household_income | California | 61818.0 | 113832 |
| 38 | household_income | California | 61818.0 | 89164 |
| 39 | household_income | California | 61818.0 | 79077 |
| 40 | household_income | California | 61818.0 | 87291 |
| 41 | household_income | California | 61818.0 | 80683 |
| 42 | household_income | California | 61818.0 | 85384 |
| 43 | household_income | California | 61818.0 | 83820 |
| 44 | household_income | California | 61818.0 | 90368 |
| 45 | household_income | California | 61818.0 | 107573 |
| 46 | household_income | California | 61818.0 | 89188 |
| 47 | household_income | California | 61818.0 | 75418 |
| 48 | household_income | California | 61818.0 | 89592 |
| 49 | household_income | California | 61818.0 | 74118 |
| 50 | household_income | California | 61818.0 | 83926 |
| 51 | household_income | California | 61818.0 | 70043 |
| 52 | household_income | California | 61818.0 | 76162 |
| 53 | household_income | California | 61818.0 | 62387 |
| 54 | household_income | California | 61818.0 | 75317 |
| 55 | household_income | California | 61818.0 | 66457 |
| 56 | household_income | California | 61818.0 | 87052 |
| 57 | household_income | California | 61818.0 | 84183 |
| 58 | household_income | California | 61818.0 | 67480 |
| 59 | household_income | California | 61818.0 | 68984 |
While looking into the California dataset, because we have only one household income for every state from the census dataset which is $61818, therefore all the income corresponding to the total gun registration for years 2011-2015 is the same.
# to show there is only one household income value for each state from census dataset
census[census.Fact == 'Median household income (in 2015 dollars), 2011-2015']
| Fact | Alabama | Alaska | Arizona | Arkansas | California | Colorado | Connecticut | Delaware | Florida | ... | South Dakota | Tennessee | Texas | Utah | Vermont | Virginia | Washington | West Virginia | Wisconsin | Wyoming | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 47 | Median household income (in 2015 dollars), 201... | 43623.0 | 72515.0 | 50255.0 | 41371.0 | 61818.0 | 60629.0 | 70331.0 | 60509.0 | 47507.0 | ... | 50957.0 | 45219.0 | 53207.0 | 60727.0 | 55176.0 | 65015.0 | 61062.0 | 41751.0 | 53357.0 | 58840.0 |
1 rows × 51 columns
# Creating mask for household income
household = df_income.Fact == 'household_income'
household
0 True
1 True
2 True
3 True
4 True
...
355 False
356 False
357 False
358 False
359 False
Name: Fact, Length: 360, dtype: bool
# Creating mask for per capita income
per_capita = df_income.Fact == 'percapita_income'
per_capita
0 False
1 False
2 False
3 False
4 False
...
355 True
356 True
357 True
358 True
359 True
Name: Fact, Length: 360, dtype: bool
# Data representing the median guns registration and per capita income (in USD) for the years 2011-2015.
df_income[per_capita].groupby('state')['totals', 'income'].median()
| totals | income | |
|---|---|---|
| state | ||
| California | 109651.0 | 30318.0 |
| Kentucky | 214013.5 | 24063.0 |
| Texas | 113715.0 | 26999.0 |
The statistics shows the per capita income for Kentucky is least, however the gun registration is maximum. California has high per capita income but gun registration is least compared to other two states.
# Data representing the median guns registration and houshold income (in USD) for the years 2011-2015.
df_income[household].groupby('state')['totals', 'income'].median()
| totals | income | |
|---|---|---|
| state | ||
| California | 109651.0 | 61818.0 |
| Kentucky | 214013.5 | 43740.0 |
| Texas | 113715.0 | 53207.0 |
Similarly for median household income is least for Kentucky, however the gun registration is maximum. California has high median income but gun registration is least compared to other two states.
# Plot showing the relationship of income with the registration
sns.jointplot(x="income", y="totals", data=df_income.query('Fact == "household_income"'), hue='state', palette='husl')
plt.suptitle('Gun registration and median household income for year 2011-2015', y=1)
plt.subplots_adjust(hspace = 1.8)
plt.show();
# Joint plot showing relationship between total gun registration and per capita income
sns.jointplot(x="income", y="totals", data=df_income.query('Fact == "percapita_income"'), hue='state', palette='husl')
plt.suptitle('Gun registration and per capita income for year 2011-2015', y=1)
plt.subplots_adjust(hspace = 1.8)
plt.show();
Both jointplots for median household income and per capita income (in USD) shows that Kentucky has higher gun registration although the income is least.
Results
Exploratory data analysis of these datasets highlights the trend for the guns registration over the years for all the states of US.
Interesting results are observed for Kentucky showing the higher gun registration over the years. Further analysis is done with the population and income insights.
The data also shows that North Carolina shows low gun registration except for the year 2014 which shows more than 500K registrations.
Other states have also higher gun registration over the years although the total is less than 100K.
Limitations
The descriptive statistics of population and income with the gun registration for different states provide some insights, however the correlation is weak. This means that further hypothesis testing needs to be done to prove if there is strong correlation between those variables.
Various transactions for gun registration has a lot of missing data. More detailed insights would have been derived if we used different statistical ways for filling those null values and find the correlation using hypothesis testing.
Also, as we can see that there is only one data point corresponding to each Fact column of census. More data on census would have provided a clear picture about these correlations.